passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	netlink: Fix selection of template interface	David Gibson	2024-03-20	2	-26/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since f919dc7a4b1c ("conf, netlink: Don't require a default route to start"), if there is only one host interface with routes, we will pick that as the template interface, even if there are no default routes for an IP version. Unfortunately this selection had a serious flaw: in some cases it would 'return' in the middle of an nl_foreach() loop, meaning we wouldn't consume all the netlink responses for our query. This could cause later netlink operations to fail as we read leftover responses from the aborted query. Rewrite the interface detection to avoid this problem. While we're there: * Perform detection of both default and non-default routes in a single pass, avoiding an ugly goto * Give more detail on error and working but unusual paths about the situation (no suitable interface, multiple possible candidates, etc.). Fixes: f919dc7a4b1c ("conf, netlink: Don't require a default route to start") Link: https://bugs.passt.top/show_bug.cgi?id=83 Link: https://github.com/containers/podman/issues/22052 Link: https://bugzilla.redhat.com/show_bug.cgi?id=2270257 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Use info(), not warn() for somewhat expected cases where one IP version has no default routes, or no routes at all] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Fix handling of NLMSG_DONE in nl_route_dup()2024_03_19.d35bcbe	David Gibson	2024-03-19	1	-9/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A recent kernel change 87d381973e49 ("genetlink: fit NLMSG_DONE into same read() as families") changed netlink behaviour so that the NLMSG_DONE terminating a bunch of responses can go in the same datagram as those responses, rather than in a separate one. Our netlink code is supposed to handle that behaviour, and indeed does so for most cases, using the nl_foreach() macro. However, there was a subtle error in nl_route_dup() which doesn't work with this change. f00b1534 ("netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE") attempted to fix this, but has its own subtle error. The problem arises because nl_route_dup(), unlike other cases doesn't just make a single pass through all the responses to a netlink request. It needs to get all the routes, then make multiple passes through them. We don't really have anywhere to buffer multiple datagrams, so we only support the case where all the routes fit in a single datagram - but we need to fail gracefully when that's not the case. After receiving the first datagram of responses (with nl_next()) we have a first loop scanning them. It needs to exit when either we run out of messages in the datagram (!NLMSG_OK()) or when we get a message indicating the last response (nl_status() <= 0). What we do after the loop depends on which exit case we had. If we saw the last response, we're done, but otherwise we need to receive more datagrams to discard the rest of the responses. We attempt to check for that second case by re-checking NLMSG_OK(nh, status). However in the got-last-response case, we've altered status from the number of remaining bytes to the error code (usually 0). That means NLMSG_OK() now returns false even if it didn't during the loop check. To fix this we need separate variables for the number of bytes left and the final status code. We also checked status after the loop, but this was redundant: we can only exit the loop with NLMSG_OK() == true if status <= 0. Reported-by: Martin Pitt <mpitt@redhat.com> Fixes: f00b153414b1 ("netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE") Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") Link: https://github.com/containers/podman/issues/22052 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fedora: Switch license identifier to SPDX2024_03_18.615d370	Dan Čermák	2024-03-18	1	-1/+1
\| \| \| \| \| \| \| \|	The spec file patch by Dan Čermák was originally contributed at: https://src.fedoraproject.org/rpms/passt/pull-request/1 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Translate source address of resolver only for DNS remapped queries	Stefano Brivio	2024-03-18	1	-6/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Paul reports that if pasta is configured with --dns-forward, and the container queries a resolver which is configured on the host directly, without using the address given for --dns-forward, we'll translate the source address of the response pretending it's coming from the address passed as --dns-forward, and the client will discard the reply. That is, $ cat /etc/resolv.conf 198.51.100.1 $ pasta --config-net --dns-forward 192.0.2.1 nslookup passt.top will not work, because we change the source address of the reply from 198.51.100.1 to 192.0.2.1. But the client contacted 198.51.100.1, and it's from that address that it expects an answer. Add a PORT_DNS_FWD flag for tap-facing ports, which is triggered by activity in the opposite direction as the other flags. If the tap-facing port was seen sending a DNS query that was remapped, we'll remap the source address of the response, otherwise we'll leave it unaffected. Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, netlink: Don't require a default route to start	Stefano Brivio	2024-03-18	3	-21/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There might be isolated testing environments where default routes and global connectivity are not needed, a single interface has all non-loopback addresses and routes, and still passt and pasta are expected to work. In this case, it's pretty obvious what our upstream interface should be, so go ahead and select the only interface with at least one route, disabling DHCP and implying --no-map-gw as the documentation already states. If there are multiple interfaces with routes, though, refuse to start, because at that point it's really not clear what we should do. Reported-by: Martin Pitt <mpitt@redhat.com> Link: https://github.com/containers/podman/issues/21896 Signed-off-by: Stefano brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Don't try to get further datagrams in nl_route_dup() on NLMSG_DONE	Stefano Brivio	2024-03-18	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Martin reports that, with Fedora Linux kernel version kernel-core-6.9.0-0.rc0.20240313gitb0546776ad3f.4.fc41.x86_64, including commit 87d381973e49 ("genetlink: fit NLMSG_DONE into same read() as families"), pasta doesn't exit once the network namespace is gone. Actually, pasta is completely non-functional, at least with default options, because nl_route_dup(), which duplicates routes from the parent namespace into the target namespace at start-up, is stuck on a second receive operation for RTM_GETROUTE. However, with that commit, the kernel is now able to fit the whole response, including the NLMSG_DONE message, into a single datagram, so no further messages will be received. It turns out that commit 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") accidentally relied on the fact that we would always get at least two datagrams as a response to RTM_GETROUTE. That is, the test to check if we expect another datagram, is based on the 'status' variable, which is 0 if we just parsed NLMSG_DONE, but we'll also expect another datagram if NLMSG_OK on the last message is false. But NLMSG_OK with a zero length is always false. The problem is that we don't distinguish if status is zero because we got a NLMSG_DONE message, or because we processed all the available datagram bytes. Introduce an explicit check on NLMSG_DONE. We should probably refactor this slightly, for example by introducing a special return code from nl_status(), but this is probably the least invasive fix for the issue at hand. Reported-by: Martin Pitt <mpitt@redhat.com> Link: https://github.com/containers/podman/issues/22052 Fixes: 4d6e9d0816e2 ("netlink: Always process all responses to a netlink request") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Rename tap_iov_{base,len}	David Gibson	2024-03-14	3	-17/+17
\| \| \| \| \| \| \| \| \| \| \|	These two functions are typically used to calculate values to go into the iov_base and iov_len fields of a struct iovec. They don't have to be used for that, though. Rename them in terms of what they actually do: calculate the base address and total length of the complete frame, including both L2 and tap specific headers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Implement tap_send() "slow path" in terms of fast path	David Gibson	2024-03-14	3	-25/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Most times we send frames to the guest it goes via tap_send_frames(). However "slow path" protocols - ARP, ICMP, ICMPv6, DHCP and DHCPv6 - go via tap_send(). As well as being a semantic duplication, tap_send() contains at least one serious problem: it doesn't properly handle short sends, which can be fatal on the qemu socket connection, since frame boundaries will get out of sync. Rewrite tap_send() to call tap_send_frames(). While we're there, rename it tap_send_single() for clarity. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Simplify some casts in the tap "slow path" functions	David Gibson	2024-03-14	1	-23/+18
\| \| \| \| \| \| \| \|	We can both remove some variables which differ from others only in type, and slightly improve type safety. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Extend tap_send_frames() to allow multi-buffer frames	David Gibson	2024-03-14	4	-37/+59
\| \| \| \| \| \| \| \| \| \| \|	tap_send_frames() takes a vector of buffers and requires exactly one frame per buffer. We have future plans where we want to have multiple buffers per frame in some circumstances, so extend tap_send_frames() to take the number of buffers per frame as a parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Improve comment to rembufs calculation] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt, log: Call __openlog() earlier, log to stderr until we detach	Stefano Brivio	2024-03-14	2	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Paul reports that, with commit 15001b39ef1d ("conf: set the log level much earlier"), early messages aren't reported to standard error anymore. The reason is that, once the log mask is changed from LOG_EARLY, we don't force logging to stderr, and this mechanism was abused to have early errors on stderr. Now that we drop LOG_EARLY earlier on, this doesn't work anymore. Call __openlog() as soon as we know the mode we're running as, using LOG_PERROR. Then, once we detach, if we're not running from an interactive terminal and logging to standard error is not forced, drop LOG_PERROR from the options. While at it, check if the standard error descriptor refers to a terminal, instead of checking standard output: if the user redirects standard output to /dev/null, they might still want to see messages from standard error. Further, make sure we don't print messages to standard error reporting that we couldn't log to the system logger, if we didn't open a connection yet. That's expected. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 15001b39ef1d ("conf: set the log level much earlier") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	pcap: Use clock_gettime() instead of gettimeofday()	Stefano Brivio	2024-03-14	2	-13/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	POSIX.1-2008 declared gettimeofday() as obsolete, but I'm a dinosaur. Usually, C libraries translate that to the clock_gettime() system call anyway, but this doesn't happen in Jon's environment, and, there, seccomp happily kills pasta(1) when started with --pcap, because we didn't add gettimeofday() to our seccomp profiles. Use clock_gettime() instead. Reported-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	passt.1: --{no-,}dhcp-dns and --{no-,}dhcp-search don't take addresses	Stefano Brivio	2024-03-14	1	-4/+4
\| \| \| \| \| \| \| \|	...they are simple enable/disable options. Fixes: 89678c515755 ("conf, udp: Introduce basic DNS forwarding") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Warn if we can't advertise any nameserver via DHCP, NDP, or DHCPv6	Stefano Brivio	2024-03-14	1	-2/+15
\| \| \| \| \| \| \| \| \| \| \|	We might have read from resolv.conf, or from the command line, a resolver that's reachable via loopback address, but that doesn't mean we can offer that via DHCP, NDP or DHCPv6: warn if there are no resolvers we can offer for a given IP version. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Handle addresses passed via --dns just like the ones from resolv.conf	Stefano Brivio	2024-03-14	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...that is, call add_dns4() and add_dns6() instead of simply adding those to the list of servers we advertise. Most importantly, this will set the 'dns_host' field for the matching IP version, so that, as mentioned in the man page, servers passed via --dns are used for DNS mapping as well, if used in combination with --dns-forward. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://bugs.passt.top/show_bug.cgi?id=82 Fixes: 89678c515755 ("conf, udp: Introduce basic DNS forwarding") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Paul Holzinger <pholzing@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Capture only packets that are actually sent	Laurent Vivier	2024-03-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	In tap_send_frames(), if we failed to send all the frames, we must only log the frames that have been sent, not all the frames we wanted to send. Fixes: dda7945ca9c9 ("pcap: Handle short writes in pcap_frame()") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use existing helper for UDP checksum on inbound IPv6 packets	David Gibson	2024-03-13	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \|	Currently we open code the calculation of the UDP checksum in udp_update_hdr6(). We calling a helper to handle the IPv6 pseudo-header, and preset the checksum field to 0 so an uninitialised value doesn't get folded in. We already have a helper to do this: csum_udp6() which we use in some slow paths. Use it here as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Avoid unnecessary pointer in udp_update_hdr4()	David Gibson	2024-03-13	1	-9/+9
\| \| \| \| \| \| \| \| \|	We carry around the source address as a pointer to a constant struct in_addr. But it's silly to carry around a 4 or 8 byte pointer to a 4 byte IPv4 address. Just copy the IPv4 address around by value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Re-order udp_update_hdr[46] for clarity and brevity	David Gibson	2024-03-13	1	-26/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The order of things in these functions is a bit odd for historical reasons. We initialise some IP header fields early, the more later after making some tests. Likewise we declare some variables without initialisation, but then unconditionally set them to values we could calculate at the start of the function. Previous cleanups have removed the reasons for some of these choices, so reorder for clarity, and where possible move the first assignment into an initialiser. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Pass data length explicitly to to udp_update_hdr[46]	David Gibson	2024-03-13	1	-12/+16
\| \| \| \| \| \| \| \| \| \| \| \|	These functions take an index to the L2 buffer whose header information to update. They use that for two things: to locate the buffer pointer itself, and to retrieve the length of the received message from the paralllel udp[46]_l2_mh_sock array. The latter is arguably a failure to separate concerns. Change these functions to explicitly take a buffer pointer and payload length as parameters. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Consistent port variable names in udp_update_hdr[46]	David Gibson	2024-03-13	1	-18/+18
\| \| \| \| \| \| \| \| \|	In these functions we have 'dstport' for the destination port, but 'src_port' for the source port. Change the latter to 'srcport' for consistency. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Refactor udp_sock[46]_iov_init()	David Gibson	2024-03-13	1	-52/+50
\| \| \| \| \| \| \| \| \| \| \| \|	Each of these functions have 3 essentially identical loops in a row. Merge the loops into a single common udp_sock_iov_init() function, calling udp_sock[46]_iov_init_one() helpers to initialize each "slot" in the various parallel arrays. This is slightly neater now, and more naturally allows changes we want to make where more initialization will become common between IPv4 and IPv6. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Don't warn if nameservers were found, but won't be advertised	Stefano Brivio	2024-03-12	1	-3/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Starting from commit 3a2afde87dd1 ("conf, udp: Drop mostly duplicated dns_send arrays, rename related fields"), we won't add to c->ip4.dns and c->ip6.dns nameservers that can't be used by the guest or container, and we won't advertise them. However, the fact that we don't advertise any nameserver doesn't mean that we didn't find any, and we should warn only if we couldn't find any. This is particularly relevant in case both --dns-forward and --no-map-gw are passed, and a single loopback address is listed in /etc/resolv.conf: we'll forward queries directed to the address specified by --dns-forward to the loopback address we found, we won't advertise that address, so we shouldn't warn: this is a perfectly legitimate usage. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/issues/19213 Fixes: 3a2afde87dd1 ("conf, udp: Drop mostly duplicated dns_send arrays, rename related fields") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: Paul Holzinger <pholzing@redhat.com>
*	icmp: Use 'flowside' epoll references for ping sockets	David Gibson	2024-03-12	5	-44/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently ping sockets use a custom epoll reference type which includes the ICMP id. However, now that we have entries in the flow table for ping flows, finding that is sufficient to get everything else we want, including the id. Therefore remove the icmp_epoll_ref type and use the general 'flowside' field for ping sockets. Having done this we no longer need separate EPOLL_TYPE_ICMP and EPOLL_TYPE_ICMPV6 reference types, because we can easily determine which case we have from the flow type. Merge both types into EPOLL_TYPE_PING. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Flow based error reporting	David Gibson	2024-03-12	1	-14/+12
\| \| \| \| \| \| \| \| \| \| \| \|	Use flow_dbg() and flow_err() helpers to generate flow-linked error messages in most places. Make a few small improvements to the messages while we're at it. This allows us to avoid the awkward 'pname' variables since whether we're dealing with ICMP or ICMPv6 is already built into the flow type which these helpers include. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Coding style fix in icmp_tap_handler()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Store ping socket information in flow table	David Gibson	2024-03-12	8	-88/+116
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently icmp_id_map[][] stores information about ping sockets in a bespoke structure. Move the same information into new types of flow in the flow table. To match that change, replace the existing ICMP timer with a flow-based timer for expiring ping sockets. This has the advantage that we only need to scan the active flows, not all possible ids. We convert icmp_id_map[][] to point to the flow table entries, rather than containing its own information. We do still use that array for locating the right ping flows, rather than using a "flow native" form of lookup for the time being. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Update id_sock description in comment to icmp_ping_new()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	ip: Use regular htons() for non-constant protocol number in L2_BUF_IP4_PSUM	Stefano Brivio	2024-03-08	1	-1/+1
\| \| \| \| \| \| \| \|	instead of htons_constant(), which is for... constants. Fixes: 5bf200ae8a1a ("tcp, udp: Don't include destination address in partially precomputed csums") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	iov: Improve documentation of iov_skip_bytes()	David Gibson	2024-03-07	2	-15/+15
\| \| \| \| \| \| \| \| \|	As pointed out in review, the documentation comments for iov_skip_bytes() are more confusing than they should be. Reword them, including updating parameter names, to make it clearer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Introduce tcp_fill_headers4()/tcp_fill_headers6()	Laurent Vivier	2024-03-06	1	-50/+106
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Replace the macro SET_TCP_HEADER_COMMON_V4_V6() by a new function tcp_fill_header(). Move IPv4 and IPv6 code from tcp_l2_buf_fill_headers() to tcp_fill_headers4() and tcp_fill_headers6() Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-ID: <20240303135114.1023026-10-lvivier@redhat.com> [dwg: Correct commit message with new function names] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: make tap_update_mac() generic	Laurent Vivier	2024-03-06	4	-12/+12
\| \| \| \| \| \| \| \| \| \|	Use ethhdr rather than tap_hdr. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-9-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: introduce functions to compute the header part checksum for TCP/UDP	Laurent Vivier	2024-03-06	4	-50/+90
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The TCP and UDP checksums are computed using the data in the TCP/UDP payload but also some informations in the IP header (protocol, length, source and destination addresses). We add two functions, proto_ipv4_header_psum() and proto_ipv6_header_psum(), to compute the checksum of the IP header part. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-ID: <20240303135114.1023026-8-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: use csum_ip4_header() in udp.c and tcp.c	Laurent Vivier	2024-03-06	5	-46/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We can find the same function to compute the IPv4 header checksum in tcp.c, udp.c and tap.c Use the function defined for tap.c, csum_ip4_header(), but with the code used in tcp.c and udp.c as it doesn't need a fully initialiazed IPv4 header, only protocol, tot_len, saddr and daddr. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-7-lvivier@redhat.com> [dwg: Fix weird cppcheck regression; it appears to be a problem in pre-existing code, but somehow this patch is exposing it] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: little cleanup in udp_update_hdrX() to prepare future changes	Laurent Vivier	2024-03-06	1	-20/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in udp_update_hdr4(): Assign the source address to src, either b->s_in.sin_addr, c->ip4.dns_match or c->ip4.gw and then set b->iph.saddr to src->s_addr. in udp_update_hdr6(): Assign the source address to src, either b->s_in6.sin6_addr, c->ip6.dns_match, c->ip6.gw or c->ip6.addr_ll. Assign the destination to dst, either c->ip6.addr_seen or &c->ip6.addr_ll_seen. Then set dst to b->ip6h.daddr and src to b->ip6h.saddr. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-ID: <20240303135114.1023026-6-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: move IP stuff from util.[ch] to ip.[ch]	Laurent Vivier	2024-03-06	17	-136/+175
\| \| \| \| \| \| \| \| \| \| \| \|	Introduce ip.[ch] file to encapsulate IP protocol handling functions and structures. Modify various files to include the new header ip.h when it's needed. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-5-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: add csum_iov()	Laurent Vivier	2024-03-06	2	-14/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce the function csum_unfolded() that computes the unfolded 32-bit checksum of a data buffer, and call it from csum() that returns the folded value. Introduce csum_iov() that computes the checksum using csum_folded() on all vectors of the iovec array and returns the folded result. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-4-lvivier@redhat.com> [dwg: Fixed trivial cppcheck & clang-tidy regressions] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: align buffers	Laurent Vivier	2024-03-06	1	-23/+24
\| \| \| \| \| \| \| \| \| \| \| \| \|	If buffer is not aligned use sum_16b() only on the not aligned part, and then use csum_avx2() on the remaining part Remove unneeded now function csum_unaligned(). Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-3-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pcap: add pcap_iov()	Laurent Vivier	2024-03-06	3	-5/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce a new function pcap_iov() to capture packet desribed by an IO vector. Update pcap_frame() to manage iovcnt > 1. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-2-lvivier@redhat.com> [dwg: Fixed trivial cppcheck regressions] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	fwd: Rename port_fwd.[ch] and their contents	David Gibson	2024-02-29	9	-53/+53
\| \| \| \| \| \| \| \| \| \| \| \|	Currently port_fwd.[ch] contains helpers related to port forwarding, particular automatic port forwarding. We're planning to allow much more flexible sorts of forwarding, including both port translation and NAT based on the flow table. This will subsume the existing port forwarding logic, so rename port_fwd.[ch] to fwd.[ch] with matching updates to all the names within. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Fix copypasta error in port_fwd_scan_udp() comments	David Gibson	2024-02-29	1	-1/+1
\| \| \| \| \| \| \| \| \|	port_fwd_scan_udp() handles UDP, as the name suggests, but its function comment has the wrong function name and references TCP, due to a bad copy-paste from port_fwd_scan_tcp(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Disallow loopback addresses on tap interface	David Gibson	2024-02-29	1	-0/+19
\| \| \| \| \| \| \| \| \| \| \| \| \|	The "tap" interface, whether it's actually a tuntap device or a qemu socket, presents a virtual external link between different network hosts. Hence, loopback addresses make no sense there. However, nothing prevents the guest from putting bogus packets with loopback addresses onto the interface and it's not entirely clear what effect that will have on passt. Explicitly test for such packets and drop them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Validate TCP endpoint addresses	David Gibson	2024-02-29	1	-7/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TCP connections should typically not have wildcard addresses (0.0.0.0 or ::) nor a zero port number for either endpoint. It's not entirely clear (at least to me) if it's strictly against the RFCs to do so, but at any rate the socket interfaces often treat those values specially[1], so it's not really possible to manipulate such connections. Likewise they should not have broadcast or multicast addresses for either endpoint. However, nothing prevents a guest from creating a SYN packet with such values, and it's not entirely clear what the effect on passt would be. To ensure sane behaviour, explicitly check for this case and drop such packets, logging a debug warning (we don't want a higher level, because that would allow a guest to spam the logs). We never expect such an address on an accept()ed socket either, but just in case, check for it as well. [1] Depending on context as "unknown", "match any" or "kernel, pick something for me" Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Parse listening socket epoll ref in tcp_listen_handler()	David Gibson	2024-02-29	3	-13/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_listen_handler() uses the epoll reference for the listening socket it handles, and also passes on one variant of it to tcp_tap_conn_from_sock() and tcp_splice_conn_from_sock(). The latter two functions only need a couple of specific fields from the reference. Pass those specific values instead of the whole reference, which localises the handling of the listening (as opposed to accepted) socket and its reference entirely within tcp_listen_handler(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Improve logic deciding when to splice	David Gibson	2024-02-29	3	-17/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This makes several tweaks to improve the logic which decides whether we're able to use the splice method for a new connection. * Rather than only calling tcp_splice_conn_from_sock() in pasta mode, we check for pasta mode within it, better localising the checks. * Previously if we got a connection from a non-loopback address we'd always fall back to the "tap" path, even if the connection was on a socket in the namespace. If we did get a non-loopback address on a namespace socket, something has gone wrong and the "tap" path certainly won't be able to handle it. Report the error and close, rather than passing it along to tap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Improve error reporting on connect path	David Gibson	2024-02-29	1	-2/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This makes a number of changes to improve error reporting while connecting a new spliced socket: * We use flow_err() and similar functions so all messages include info on which specific flow was affected * We use strerror() to interpret raw error values * We now report errors on connection (at "trace" level, since this would allow spamming the logs) * We also look up and report some details on EPOLLERR events, which can include connection errors, since we use a non-blocking connect(). Again we use "trace" level since this can spam the logs. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Make tcp_splice_connect() create its own sockets	David Gibson	2024-02-29	1	-14/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently creating the connected socket for a splice is split between tcp_splice_conn_from_sock(), which opens the socket, and tcp_splice_connect() which connects it. Alter tcp_splice_connect() to open its own socket based on an address family and pif we pass it. This does require a second conditional on pif, but makes for a more logical split of functionality: tcp_splice_conn_from_sock() picks the target, tcp_splice_connect() creates the connection. While we're there improve reporting of errors Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Merge tcp_splice_new() into its caller	David Gibson	2024-02-29	1	-34/+24
\| \| \| \| \| \| \| \| \| \|	The only caller of tcp_splice_new() is tcp_splice_conn_from_sock(). Both are quite short, and the division of responsibilities between the two isn't particularly obvious. Simplify by merging the former into the latter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: More specific variable names in new splice path	David Gibson	2024-02-29	2	-20/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tcp_splice_conn_from_sock(), the 'port' variable stores the source port of the connection on the originating side. In tcp_splice_new(), called directly from it, the 'port' parameter gives the _destination_ port of the originating connection and is then updated to the destination port of the connection on the other side. Similarly, in tcp_splice_conn_from_sock(), 's' is the fd of the accetped socket (on side 0), whereas in tcp_splice_new(), 's' is the fd of the connecting socket (side 1). I, for one, find having the same variable name with different meanings in such close proximity in the flow of control pretty confusing. Alter the names for greater specificity and clarity. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Clarify flow entry life cycle, introduce uniform logging	David Gibson	2024-02-29	5	-18/+95
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Our allocation scheme for flow entries means there are some non-obvious constraints on when what things can be done with an entry. Add a big doc comment explaining the life cycle. In addition, make a FLOW_START() macro to mark one of the important transitions. This encourages correct usage, by making it natural to only access the flow type specific structure after calling it. It also logs that a new flow has been created, which is useful for debugging. We also add logging when a flow's lifecycle ends. This doesn't need a new helper, because it can only happen either from flow_alloc_cancel() or from the flow deferred handler. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Don't use flow_trace() before setting flow type	David Gibson	2024-02-29	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	In tcp_splice_conn_from_sock() we can call flow_trace() if there's an error setting TCP_QUICKACK. However, we do so before we've set the flow type in the flow entry. That means that flow_trace() will print nonsense when it tries to print the flow type. There's no reason the setsockopt() has to happen before initialising the flow entry, so just move it after. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Simplify clean up logic	David Gibson	2024-02-29	1	-15/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tcp_splice_flow_defer() contains specific logic to determine if we're far enough initialised that we need to close pipes and/or sockets. This is potentially fragile if we change something about the order in which we do things. We can simplify this by initialising the pipe and socket fields to -1 very early, then close()ing them if and only if they're non-negative. This lets us remove a special case cleanup if our connect() fails. This will already trigger a CLOSING event, and the socket fd in question is populated in the connection structure. Thus we can let the new cleanup logic handle it rather than requiring an explicit close(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>