passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	pasta: Strip RTA_PREFSRC when copying routes to the namespace2023_08_23.a7e4bfb	David Gibson	2023-08-23	1	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Host routes can include a preferred source address (RTA_PREFSRC), which must be one of the host's addresses. However when using pasta with -a the namespace might be given a different address, not on the host. This seems to occur pretty routinely depending on the network configuration systems in place on the host. With --config-net we will try to copy host routes to the namespace. If one of those includes an RTA_PREFSRC, but the namespace doesn't have the host address, this will fail with -EINVAL, causing pasta to fail. Fix this by stripping off RTA_PREFSRC attributes from routes as we copy them to the namespace. This is by no means infallible, bit it should at least handle common cases for the time being. Link: https://bugs.passt.top/show_bug.cgi?id=71 Link: https://github.com/containers/podman/pull/19699#issuecomment-1688769287 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Set IFA_ADDRESS, not just IFA_LOCAL, while adding IPv4 addresses	Stefano Brivio	2023-08-23	1	-0/+1
\| \| \| \| \| \| \| \| \|	Otherwise, we actually configure the address, but it's not usable because no local route is added by the kernel. Link: https://github.com/containers/podman/pull/19699 Fixes: cfe7509e5c16 ("netlink: Use struct in_addr for IPv4 addresses, not bare uint32_t") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove broken pressure calculations for tcp_defer_handler()	David Gibson	2023-08-22	3	-13/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_defer_handler() performs a potentially expensive linear scan of the connection table. So, to mitigate the cost of that we skip if if we're not under at least moderate pressure: either 30% of available connections or 30% (estimated) of available fds used. But, the calculation for this has been broken since it was introduced: we calculate "max_conns" based on c->tcp.conn_count, not TCP_MAX_CONNS, meaning we only exit early if conn_count is less than 30% of itself, i.e. never. If that calculation is "corrected" to be based on TCP_MAX_CONNS, it completely tanks the TCP CRR times for passt - from ~60ms to >1000ms on my laptop. My guess is that this is because in the case of many short lived connections, we're letting the table become much fuller before compacting it. That means that other places which perform a table scan now have to do much, much more. For the time being, simply remove the tests, since they're not doing anything useful. We can reintroduce them more carefully if we see a need for them. This also removes the only user of c->tcp.splice_conn_count, so that can be removed as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Add missing double include guard to inany.h	David Gibson	2023-08-22	1	-0/+5
\| \| \| \| \| \| \|	This was overlooked when the file was created. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Move in_epoll flag out of common connection structure	David Gibson	2023-08-22	3	-7/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The in_epoll boolean is one of only two fields (currently) in the common structure shared between tap and spliced connections. It seems like it belongs there, because both tap and spliced connections use it, and it has roughly the same meaning. Roughly, however, isn't exactly: which fds this flag says are in the epoll varies between the two connection types, and are in type specific fields. So, it's only possible to meaningfully use this value locally in type specific code anyway. This common field is going to get in the way of more widespread generalisation of connection / flow tracking, so move it to separate fields in the tap and splice specific structures. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Don't pre-fill IPv4 destination address in headers	David Gibson	2023-08-22	8	-32/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because packets sent on the tap interface will always be going to the guest/namespace, we more-or-less know what address they'll be going to. So we pre-fill this destination address in our header buffers for IPv4. We can't do the same for IPv6 because we could need either the global or link-local address for the guest. In future we're going to want more flexibility for the destination address, so this pre-filling will get in the way. Change the flow so we always fill in the IPv4 destination address for each packet, rather than prefilling it from proto_update_l2_buf(). In fact for TCP we already redundantly filled the destination for each packet anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Don't include destination address in partially precomputed csums	David Gibson	2023-08-22	3	-52/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We partially prepopulate IP and TCP header structures including, amongst other things the destination address, which for IPv4 is always the known address of the guest/namespace. We partially precompute both the IPv4 header checksum and the TCP checksum based on this. In future we're going to want more flexibility with controlling the destination for IPv4 (as we already do for IPv6), so this precomputed value gets in the way. Therefore remove the IPv4 destination from the precomputed checksum and fold it into the checksum update when we actually send a packet. Doing this means we no longer need to recompute those partial sums when the destination address changes ({tcp,udp}_update_l2_buf()) and instead the computation can be moved to compile time. This means while we perform slightly more computations on each packet, we slightly reduce the amount of memory we need to access. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Consistent usage of ports in tcp_seq_init()	David Gibson	2023-08-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tcp_seq_init() the meaning of "src" and "dst" isn't really clear since it's used for connections in both directions. However, these values are just feeding a hash, so as long as we're consistent and include all the information we want, it doesn't really matter. Oddly, for the "src" side we supply the (tap side) forwarding address but the (tap side) endpoint port. This again doesn't really matter, but it's confusing. So swap this with dstport, so "src" is always forwarding and "dst" is always endpoint. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: More precise terms for addresses and ports	David Gibson	2023-08-22	2	-52/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a number of places the comments and variable names we use to describe addresses and ports are ambiguous. It's not sufficient to describe a port as "tap-facing" or "socket-facing", because on both the tap side and the socket side there are two ports for the two ends of the connection. Similarly, "local" and "remote" aren't particularly helpful, because it's not necessarily clear whether we're talking from the point of view of the guest/namespace, the host, or passt itself. This patch makes a number of changes to be more precise about this. It introduces two new terms in aid of this: A "forwarding" address (or port) refers to an address which is local from the point of view of passt itself. That is a source address for traffic sent by passt, whether it's to the guest via the tap interface or to a host on the internet via a socket. The "endpoint" address (or port) is the reverse: a remote address from passt's point of view, the destination address for traffic sent by passt. Between them the "side" (either tap/guest-facing or sock/host-facing) and forwarding vs. endpoint unambiguously describes which address or port we're talking about. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Pass source address to protocol handler functions	David Gibson	2023-08-22	7	-32/+48
\| \| \| \| \| \| \| \| \| \| \|	The tap code passes the IPv4 or IPv6 destination address of packets it receives to the protocol specific code. Currently that protocol code doesn't use the source address, but we want it to in future. So, in preparation, pass the IPv4/IPv6 source address of tap packets to those functions as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't clobber source address in tap6_handler()	David Gibson	2023-08-22	1	-2/+0
	In tap6_handler() saddr is initialized to the IPv6 source address from the incoming packet. However part way through, but before organizing the packet into a "sequence" we set it unconditionally to the guest's assigned address. We don't do anything equivalent for IPv4. This doesn't make a lot of sense: if the guest is using a different source address it makes sense to consider these different sequences of packets and we shouldn't try to combine them together. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>