passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	flow, tcp: Flow based NAT and port forwarding for TCP	David Gibson	2024-07-19	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently the code to translate host side addresses and ports to guest side addresses and ports, and vice versa, is scattered across the TCP code. This includes both port redirection as controlled by the -t and -T options, and our special case NAT controlled by the --no-map-gw option. Gather this logic into fwd_nat_from_*() functions for each input interface in fwd.c which take protocol and address information for the initiating side and generates the pif and address information for the forwarded side. This performs any NAT or port forwarding needed. We create a flow_target() helper which applies those forwarding functions as needed to automatically move a flow from INI to TGT state. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Parse listening socket epoll ref in tcp_listen_handler()	David Gibson	2024-02-29	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_listen_handler() uses the epoll reference for the listening socket it handles, and also passes on one variant of it to tcp_tap_conn_from_sock() and tcp_splice_conn_from_sock(). The latter two functions only need a couple of specific fields from the reference. Pass those specific values instead of the whole reference, which localises the handling of the listening (as opposed to accepted) socket and its reference entirely within tcp_listen_handler(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: More specific variable names in new splice path	David Gibson	2024-02-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tcp_splice_conn_from_sock(), the 'port' variable stores the source port of the connection on the originating side. In tcp_splice_new(), called directly from it, the 'port' parameter gives the _destination_ port of the originating connection and is then updated to the destination port of the connection on the other side. Similarly, in tcp_splice_conn_from_sock(), 's' is the fd of the accetped socket (on side 0), whereas in tcp_splice_new(), 's' is the fd of the connecting socket (side 1). I, for one, find having the same variable name with different meanings in such close proximity in the flow of control pretty confusing. Alter the names for greater specificity and clarity. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Clarify flow entry life cycle, introduce uniform logging	David Gibson	2024-02-29	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Our allocation scheme for flow entries means there are some non-obvious constraints on when what things can be done with an entry. Add a big doc comment explaining the life cycle. In addition, make a FLOW_START() macro to mark one of the important transitions. This encourages correct usage, by making it natural to only access the flow type specific structure after calling it. It also logs that a new flow has been created, which is useful for debugging. We also add logging when a flow's lifecycle ends. This doesn't need a new helper, because it can only happen either from flow_alloc_cancel() or from the flow deferred handler. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Introduce union sockaddr_inany	David Gibson	2024-02-29	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are a number of places where we want to handle either a sockaddr_in or a sockaddr_in6. In some of those we use a void *, which works ok and matches some standard library interfaces, but doesn't give a signature level hint that we're dealing with only sockaddr_in or sockaddr_in6, not (say) sockaddr_un or another type of socket address. Other places we use a sockaddr_storage, which also works, but has the same problem in addition to allocating more on the stack than we need to. Introduce union sockaddr_inany to explictly handle this case: it has variants for sockaddr_in and sockaddr_in6. Use it in a number of places where it's easy to do so. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets	David Gibson	2024-01-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	Currently connected TCP sockets have the same epoll type, whether they're for a "tap" connection or a spliced connection. This means that tcp_sock_handler() has to do a secondary check on the type of the connection to call the right function. We can avoid this by adding a new epoll type and dispatching directly to the right thing. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow,tcp: Use epoll_ref type including flow and side	David Gibson	2023-12-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently TCP uses the 'flow' epoll_ref field for both connected sockets and timers, which consists of just the index of the relevant flow (connection). This is just fine for timers, for while it obviously works, it's subtly incomplete for sockets on spliced connections. In that case we want to know which side of the connection the event is occurring on as well as which connection. At present, we deduce that information by looking at the actual fd, and comparing it to the fds of the sockets on each side. When we use the flow table for more things, we expect more cases where something will need to know a specific side of a specific flow for an event, but nothing more. Therefore add a new 'flowside' epoll_ref field, with exactly that information. We use it for TCP connected sockets. This allows us to directly know the side for spliced connections. For "tap" connections, it's pretty meaningless, since the side is always the socket side. It still makes logical sense though, and it may become important for future flow table work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	cppcheck: Make many pointers const	David Gibson	2023-10-04	1	-1/+2
\| \| \| \| \| \| \| \| \|	Newer versions of cppcheck (as of 2.12.0, at least) added a warning for pointers which could be declared to point at const data, but aren't. Based on that, make many pointers throughout the codebase const. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Split handling of listening TCP sockets into their own handler	David Gibson	2023-08-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_sock_handler() handles both listening TCP sockets, and connected TCP sockets, but what it needs to do in those cases has essentially nothing in common. Therefore, give listening sockets their own epoll_type value and dispatch directly to their own handler from the top level. Furthermore, the two handlers need essentially entirely different information from the reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate the port for the listening socket, but that's not the same meaning. So, switch listening sockets to their own reference type which we can lay out as we please. That lets us remove the listen and outbound fields from the normal (connected) tcp_epoll_ref, reducing it to just the connection table index. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Relicense to GPL 2.0, or any later version	Stefano Brivio	2023-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In practical terms, passt doesn't benefit from the additional protection offered by the AGPL over the GPL, because it's not suitable to be executed over a computer network. Further, restricting the distribution under the version 3 of the GPL wouldn't provide any practical advantage either, as long as the passt codebase is concerned, and might cause unnecessary compatibility dilemmas. Change licensing terms to the GNU General Public License Version 2, or any later version, with written permission from all current and past contributors, namely: myself, David Gibson, Laine Stump, Andrea Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove splice from tcp_epoll_ref	David Gibson	2022-11-25	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Currently the epoll reference for tcp sockets includes a bit indicating whether the socket maps to a spliced connection. However, the reference also has the index of the connection structure which also indicates whether it is spliced. We can therefore avoid the splice bit in the epoll_ref by unifying the first part of the non-spliced and spliced handlers where we look up the connection state. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Use the same sockets to listen for spliced and non-spliced connections	David Gibson	2022-11-25	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pasta mode, tcp_sock_init[46]() create separate sockets to listen for spliced connections (these are bound to localhost) and non-spliced connections (these are bound to the host address). This introduces a subtle behavioural difference between pasta and passt: by default, pasta will listen only on a single host address, whereas passt will listen on all addresses (0.0.0.0 or ::). This also prevents us using some additional optimizations that only work with the unspecified (0.0.0.0 or ::) address. However, it turns out we don't need to do this. We can splice a connection if and only if it originates from the loopback address. Currently we ensure this by having the "spliced" listening sockets listening only on loopback. Instead, defer the decision about whether to splice a connection until after accept(), by checking if the connection was made from the loopback address. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unify part of spliced and non-spliced conn_from_sock path	David Gibson	2022-11-25	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tcp_sock_handler() we split off to handle spliced sockets before checking anything else. However the first steps of the "new connection" path for each case are the same: allocate a connection entry and accept() the connection. Remove this duplication by making tcp_conn_from_sock() handle both spliced and non-spliced cases, with help from more specific tcp_tap_conn_from_sock and tcp_splice_conn_from_sock functions for the later stages which differ. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Partially unify tcp_timer() and tcp_splice_timer()	David Gibson	2022-11-25	1	-1/+0
\| \| \| \| \| \| \| \| \| \|	These two functions scan all the non-splced and spliced connections respectively and perform timed updates on them. Avoid scanning the now unified table twice, by having tcp_timer scan it once calling the relevant per-connection function for each one. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unify tcp_defer_handler and tcp_splice_defer_handler()	David Gibson	2022-11-25	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \|	These two functions each step through non-spliced and spliced connections respectively and clean up entries for closed connections. To avoid scanning the connection table twice, we merge these into a single function which scans the unified table and performs the appropriate sort of cleanup action on each one. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unify spliced and non-spliced connection tables	David Gibson	2022-11-25	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently spliced and non-spliced connections are stored in completely separate tables, so there are completely independent limits on the number of spliced and non-spliced connections. This is a bit counter-intuitive. More importantly, the fact that the tables are separate prevents us from unifying some other logic between the two cases. So, merge these two tables into one, using the 'c.spliced' common field to distinguish between them when necessary. For now we keep a common limit of 128k connections, whether they're spliced or non-spliced, which means we save memory overall. If necessary we could increase this to a 256k or higher total, which would cost memory but give some more flexibility. For now, the code paths which need to step through all extant connections are still separate for the two cases, just skipping over entries which aren't for them. We'll improve that in later patches. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: #include tcp_splice.h in tcp_splice.c	David Gibson	2022-11-25	1	-3/+0
\| \| \| \| \| \| \| \| \|	This obvious include was omitted, which means that declarations in the header weren't checked against definitions in the .c file. This shows up an old declaration for a function that is now static, and a duplicate Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Add include guards	Stefano Brivio	2022-03-29	1	-0/+5
\| \| \| \| \| \| \|	...at the moment, just for consistency with packet.h, icmp.h, tcp.h and udp.h. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Packet abstraction with mandatory boundary checks	Stefano Brivio	2022-03-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement a packet abstraction providing boundary and size checks based on packet descriptors: packets stored in a buffer can be queued into a pool (without storage of its own), and data can be retrieved referring to an index in the pool, specifying offset and length. Checks ensure data is not read outside the boundaries of buffer and descriptors, and that packets added to a pool are within the buffer range with valid offset and indices. This implies a wider rework: usage of the "queueing" part of the abstraction mostly affects tap_handler_{passt,pasta}() functions and their callees, while the "fetching" part affects all the guest or tap facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6 handlers. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Close sockets right away on high number of open files	Stefano Brivio	2022-03-29	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	We can't take for granted that the hard limit for open files is big enough as to allow to delay closing sockets to a timer. Store the value of RTLIMIT_NOFILE we set at start, and use it to understand if we're approaching the limit with pending, spliced TCP connections. If that's the case, close sockets right away as soon as they're not needed, instead of deferring this task to a timer. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Refactor to use events instead of states, split out spliced implementation	Stefano Brivio	2022-03-28	1	-0/+14
	Using events and flags instead of states makes the implementation much more straightforward: actions are mostly centered on events that occurred on the connection rather than states. An example is given by the ESTABLISHED_SOCK_FIN_SENT and FIN_WAIT_1_SOCK_FIN abominations: we don't actually care about which side started closing the connection to handle closing of connection halves. Split out the spliced implementation, as it has very little in common with the "regular" TCP path. Refactor things here and there to improve clarity. Add helpers to trace where resets and flag settings come from. No functional changes intended. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>