passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	tcp: Improve handling of fallback if socket pool is empty on new splice	David Gibson	2023-02-14	1	-59/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When creating a new spliced connection, we need to get a socket in the other ns from the originating one. To avoid excessive ns switches we usually get these from a pool refilled on a timer. However, if the pool runs out we need a fallback. Currently that's done by passing -1 as the socket to tcp_splice_connnect() and running it in the target ns. This means that tcp_splice_connect() itself needs to have different cases depending on whether it's given an existing socket or not, which is a separate concern from what it's mostly doing. We change it to require a suitable open socket to be passed in, and ensuring in the caller that we have one. This requires adding the fallback paths to the caller, tcp_splice_new(). We use slightly different approaches for a socket in the init ns versus the guest ns. This also means that we no longer need to run tcp_splice_connect() itself in the guest ns, which allows us to remove a bunch of boilerplate code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Split pool lookup from creating new sockets in tcp_conn_new_sock()	David Gibson	2023-02-14	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_conn_new_sock() first looks for a socket in a pre-opened pool, then if that's empty creates a new socket in the init namespace. Both parts of this are duplicated in other places: the pool lookup logic is duplicated in tcp_splice_new(), and the socket opening logic is duplicated in tcp_sock_refill_pool(). Split the function into separate parts so we can remove both these duplications. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Move socket pool declarations around	David Gibson	2023-02-14	1	-7/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_splice.c has some explicit extern declarations to access the socket pools. This is pretty dangerous - if we changed the type of these variables in tcp.c, we'd have tcp.c and tcp_splice.c using the same memory in different ways with no compiler error. So, move the extern declarations to tcp_conn.h so they're visible to both tcp.c and tcp_splice.c, but not the rest of pasta. In fact the pools for the guest namespace are necessarily only used by tcp_splice.c - we have no sockets on the guest side if we're not splicing. So move those declarations and the functions that deal exclusively with them to tcp_splice.c Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Make assertions actually useful	David Gibson	2023-02-12	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are some places in passt/pasta which #include <assert.h> and make various assertions. If we hit these something has already gone wrong, but they're there so that we a useful message instead of cryptic misbehaviour if assumptions we thought were correct turn out not to be. Except.. the glibc implementation of assert() uses syscalls that aren't in our seccomp filter, so we'll get a SIGSYS before it actually prints the message. Work around this by adding our own ASSERT() implementation using our existing err() function to log the message, and an abort(). The abort() probably also won't work exactly right with seccomp, but once we've printed the message, dying with a SIGSYS works just as well as dying with a SIGABRT. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Pass union tcp_conn pointer to destroy and splice timer functions	Stefano Brivio	2022-11-25	1	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \|	The pointers are actually the same, but we later pass the container union to tcp_table_compact(), which might zero the size of the whole union, and this confuses Coverity Scan. Given that we have pointers to the container union to start with, just pass those instead, all the way down to tcp_table_compact(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp_splice: Allow splicing of connections from IPv4-mapped loopback	David Gibson	2022-11-25	1	-8/+11
\| \| \| \| \| \| \| \| \| \| \|	For non-spliced connections we now treat IPv4-mapped IPv6 addresses the same as the corresponding IPv4 addresses. However currently we won't splice a connection from ::ffff:127.0.0.1 the way we would one from 127.0.0.1. Correct this so that we can splice connections from IPv4 localhost that have been received on an IPv6 dual stack socket. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove v6 flag from tcp_epoll_ref	David Gibson	2022-11-25	1	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This bit in the TCP specific epoll reference indicates whether the connection is IPv6 or IPv4. However the sites which refer to it are already calling accept() which (optionally) returns an address for the remote end of the connection. We can use the sa_family field in that address to determine the connection type independent of the epoll reference. This does have a cost: for the spliced case, it means we now need to get that address from accept() which introduces an extran copy_to_user(). However, in future we want to allow handling IPv4 connectons through IPv6 sockets, which means we won't be able to determine the IP version at the time we create the listening socket and epoll reference. So, at some point we'll have to pay this cost anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Helper functions for handling addresses which could be IPv4 or IPv6	David Gibson	2022-11-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	struct tcp_conn stores an address which could be IPv6 or IPv4 using a union. We can do this without an additional tag by encoding IPv4 addresses as IPv4-mapped IPv6 addresses. This approach is useful wider than the specific place in tcp_conn, so expose a new 'union inany_addr' like this from a new inany.h. Along with that create a number of helper functions to make working with these "inany" addresses easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove splice from tcp_epoll_ref	David Gibson	2022-11-25	1	-16/+10
\| \| \| \| \| \| \| \| \| \| \| \|	Currently the epoll reference for tcp sockets includes a bit indicating whether the socket maps to a spliced connection. However, the reference also has the index of the connection structure which also indicates whether it is spliced. We can therefore avoid the splice bit in the epoll_ref by unifying the first part of the non-spliced and spliced handlers where we look up the connection state. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Use the same sockets to listen for spliced and non-spliced connections	David Gibson	2022-11-25	1	-4/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pasta mode, tcp_sock_init[46]() create separate sockets to listen for spliced connections (these are bound to localhost) and non-spliced connections (these are bound to the host address). This introduces a subtle behavioural difference between pasta and passt: by default, pasta will listen only on a single host address, whereas passt will listen on all addresses (0.0.0.0 or ::). This also prevents us using some additional optimizations that only work with the unspecified (0.0.0.0 or ::) address. However, it turns out we don't need to do this. We can splice a connection if and only if it originates from the loopback address. Currently we ensure this by having the "spliced" listening sockets listening only on loopback. Instead, defer the decision about whether to splice a connection until after accept(), by checking if the connection was made from the loopback address. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unify part of spliced and non-spliced conn_from_sock path	David Gibson	2022-11-25	1	-27/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tcp_sock_handler() we split off to handle spliced sockets before checking anything else. However the first steps of the "new connection" path for each case are the same: allocate a connection entry and accept() the connection. Remove this duplication by making tcp_conn_from_sock() handle both spliced and non-spliced cases, with help from more specific tcp_tap_conn_from_sock and tcp_splice_conn_from_sock functions for the later stages which differ. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unify the IN_EPOLL flag	David Gibson	2022-11-25	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \|	There is very little common between the tcp_tap_conn and tcp_splice_conn structures. However, both do have an IN_EPOLL flag which has the same meaning in each case, though it's stored in a different location. Simplify things slightly by moving this bit into the common header of the two structures. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Partially unify tcp_timer() and tcp_splice_timer()	David Gibson	2022-11-25	1	-32/+25
\| \| \| \| \| \| \| \| \| \|	These two functions scan all the non-splced and spliced connections respectively and perform timed updates on them. Avoid scanning the now unified table twice, by having tcp_timer scan it once calling the relevant per-connection function for each one. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unify tcp_defer_handler and tcp_splice_defer_handler()	David Gibson	2022-11-25	1	-23/+1
\| \| \| \| \| \| \| \| \| \| \|	These two functions each step through non-spliced and spliced connections respectively and clean up entries for closed connections. To avoid scanning the connection table twice, we merge these into a single function which scans the unified table and performs the appropriate sort of cleanup action on each one. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unify spliced and non-spliced connection tables	David Gibson	2022-11-25	1	-48/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently spliced and non-spliced connections are stored in completely separate tables, so there are completely independent limits on the number of spliced and non-spliced connections. This is a bit counter-intuitive. More importantly, the fact that the tables are separate prevents us from unifying some other logic between the two cases. So, merge these two tables into one, using the 'c.spliced' common field to distinguish between them when necessary. For now we keep a common limit of 128k connections, whether they're spliced or non-spliced, which means we save memory overall. If necessary we could increase this to a 256k or higher total, which would cost memory but give some more flexibility. For now, the code paths which need to step through all extant connections are still separate for the two cases, just skipping over entries which aren't for them. We'll improve that in later patches. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Improved helpers to update connections after moving	David Gibson	2022-11-25	1	-3/+14
\| \| \| \| \| \| \| \| \| \|	When we compact the connection tables (both spliced and non-spliced) we need to move entries from one slot to another. That requires some updates in the entries themselves. Add helpers to make all the necessary updates for the spliced and non-spliced cases. This will simplify later cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Add connection union type	David Gibson	2022-11-25	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, the tables for spliced and non-spliced connections are entirely separate, with different types in different arrays. We want to unify them. As a first step, create a union type which can represent either a spliced or non-spliced connection. For them to be distinguishable, the individual types need to have a common header added, with a bit indicating which type this structure is. This comes at the cost of increasing the size of tcp_tap_conn to over one (64 byte) cacheline. This isn't ideal, but it makes things simpler for now and we'll re-optimize this later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Move connection state structures into a shared header	David Gibson	2022-11-25	1	-68/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently spliced and non-spliced connections use completely independent tracking structures. We want to unify these, so as a preliminary step move the definitions for both variants into a new tcp_conn.h header, shared by tcp.c and tcp_splice.c. This requires renaming some #defines with the same name but different meanings between the two cases. In the process we correct some places that are slightly out of sync between the comments and the code for various event bit names. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Helpers for converting from index to/from tcp_splice_conn	David Gibson	2022-11-25	1	-18/+25
\| \| \| \| \| \| \| \| \| \| \|	Like we already have for non-spliced connections, create a CONN_IDX() macro for looking up the index of spliced connection structures. Change the name of the array of spliced connections to be different from that for non-spliced connections (even though they're in different modules). This will make subsequent changes a bit safer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: #include tcp_splice.h in tcp_splice.c	David Gibson	2022-11-25	1	-1/+1
\| \| \| \| \| \| \| \| \|	This obvious include was omitted, which means that declarations in the header weren't checked against definitions in the .c file. This shows up an old declaration for a function that is now static, and a duplicate Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Fix port remapping for inbound, spliced connections	Stefano Brivio	2022-10-15	1	-7/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pasta mode, when we receive a new inbound connection, we need to select a socket that was created in the namespace to proceed and connect() it to its final destination. The existing condition might pick a wrong socket, though, if the destination port is remapped, because we'll check the bitmap of inbound ports using the remapped port (stored in the epoll reference) as index, and not the original port. Instead of using the port bitmap for this purpose, store this information in the epoll reference itself, by adding a new 'outbound' bit, that's set if the listening socket was created the namespace, and unset otherwise. Then, use this bit to pick a socket on the right side. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Fixes: 33482d5bf293 ("passt: Add PASTA mode, major rework") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp, tcp_splice: Adjust comments to current meaning of inbound and outbound	Stefano Brivio	2022-10-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For tcp_sock_init_ns(), "inbound" connections used to be the ones being established toward any listening socket we create, as opposed to sockets we connect(). Similarly, tcp_splice_new() used to handle "inbound" connections in the sense that they originated from listening sockets, and they would in turn cause a connect() on an "outbound" socket. Since commit 1128fa03fe73 ("Improve types and names for port forwarding configuration"), though, inbound connections are more broadly defined as the ones directed to guest or namepsace, and outbound the ones originating from there. Update comments for those two functions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Move logging functions to a new file, log.c	Stefano Brivio	2022-10-14	1	-0/+1
\| \| \| \| \| \| \| \|	Logging to file is going to add some further complexity that we don't want to squeeze into util.c. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Consolidate port forwarding configuration into a common structure	David Gibson	2022-09-24	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The configuration for how to forward ports in and out of the guest/ns is divided between several different variables. For each connect direction and protocol we have a mode in the udp/tcp context structure, a bitmap of which ports to forward also in the context structure and an array of deltas to apply if the outward facing and inward facing port numbers are different. This last is a separate global variable, rather than being in the context structure, for no particular reason. UDP also requires an additional array which has the reverse mapping used for return packets. Consolidate these into a re-used substructure in the context structure. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp_splice: Allow up to 8 MiB as pipe size	Stefano Brivio	2022-04-07	1	-1/+1
\| \| \| \| \| \|	It actually improves throughput a bit, if allowed by user limits. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: False "Negative array index read" positives, CWE-129	Stefano Brivio	2022-04-07	1	-8/+16
\| \| \| \| \| \|	A flag or event bit is always set by callers. Reported by Coverity. Signed-by-off: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Logically dead code, CWE-561	Stefano Brivio	2022-04-07	1	-7/+1
\| \| \| \| \| \|	Reported by Coverity. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Unchecked return value from library, CWE-252	Stefano Brivio	2022-04-07	1	-13/+40
\| \| \| \| \| \| \|	All instances were harmless, but it might be useful to have some debug messages here and there. Reported by Coverity. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Invalid type in argument to printf format specifier, CWE-686	Stefano Brivio	2022-04-05	1	-7/+7
\| \| \| \| \| \|	Harmless except for two bad debugging prints. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap, tcp, udp, icmp: Cut down on some oversized buffers	Stefano Brivio	2022-03-29	1	-5/+5
\| \| \| \| \| \| \| \| \|	The existing sizes provide no measurable differences in throughput and packet rates at this point. They were probably needed as batched implementations were not complete, but they can be decreased quite a bit now. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Fix android-cloexec-* clang-tidy warnings, re-enable checks	Stefano Brivio	2022-03-29	1	-6/+5
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Mark constant references as const	Stefano Brivio	2022-03-29	1	-9/+11
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Packet abstraction with mandatory boundary checks	Stefano Brivio	2022-03-29	1	-105/+111
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement a packet abstraction providing boundary and size checks based on packet descriptors: packets stored in a buffer can be queued into a pool (without storage of its own), and data can be retrieved referring to an index in the pool, specifying offset and length. Checks ensure data is not read outside the boundaries of buffer and descriptors, and that packets added to a pool are within the buffer range with valid offset and indices. This implies a wider rework: usage of the "queueing" part of the abstraction mostly affects tap_handler_{passt,pasta}() functions and their callees, while the "fetching" part affects all the guest or tap facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6 handlers. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Use less awkward syntax to swap in/out sockets from pools	Stefano Brivio	2022-03-29	1	-7/+6
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Close sockets right away on high number of open files	Stefano Brivio	2022-03-29	1	-6/+22
\| \| \| \| \| \| \| \| \| \| \| \| \|	We can't take for granted that the hard limit for open files is big enough as to allow to delay closing sockets to a timer. Store the value of RTLIMIT_NOFILE we set at start, and use it to understand if we're approaching the limit with pending, spliced TCP connections. If that's the case, close sockets right away as soon as they're not needed, instead of deferring this task to a timer. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp, util: Enforce 24-bit limit on socket numbers	Stefano Brivio	2022-03-29	1	-0/+8
\| \| \| \| \| \| \|	This should never happen, but there are no formal guarantees: ensure socket numbers are below SOCKET_MAX. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Refactor to use events instead of states, split out spliced implementation	Stefano Brivio	2022-03-28	1	-0/+859
	Using events and flags instead of states makes the implementation much more straightforward: actions are mostly centered on events that occurred on the connection rather than states. An example is given by the ESTABLISHED_SOCK_FIN_SENT and FIN_WAIT_1_SOCK_FIN abominations: we don't actually care about which side started closing the connection to handle closing of connection halves. Split out the spliced implementation, as it has very little in common with the "regular" TCP path. Refactor things here and there to improve clarity. Add helpers to trace where resets and flag settings come from. No functional changes intended. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>