passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	netlink: Ignore EHOSTUNREACH failures when duplicating routes	Stefano Brivio	2024-06-19	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To implicitly resolve possible dependencies between routes as we duplicate them into the target namespace, we go through a set of n routes n times, and ignore EEXIST responses to netlink messages (we already inserted the route) and ENETUNREACH (we didn't insert the route yet, but we need to insert another one first). Until now, we didn't ignore EHOSTUNREACH responses. However, NetworkManager users with multiple non-subnet routes for the same interface report that pasta exits with "no route to host" while duplicating routes. This happens because NetworkManager sets the 'noprefixroute' attribute on addresses, meaning that the kernel won't create subnet routes automatically depending on the prefix length of the address. We copy this attribute as we copy the address into the target namespace, and as a result, the kernel doesn't create subnet routes in the target namespace either. This means that the gateway for routes that are inserted later can be unreachable at some points during the sequence of route duplication. That is, we don't just have dependencies between regular routes, but we can also have dependencies between regular routes and subnet routes, as subnet routes are not automatically inserted in advance. Link: https://github.com/containers/podman/issues/22824 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: With no default route, pick the first interface with a route	Stefano Brivio	2024-06-19	2	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While commit f919dc7a4b1c ("conf, netlink: Don't require a default route to start") sounded reasonable in the assumption that, if we don't find default routes for a given address family, we can still proceed by selecting an interface with any route iff it's the only one for that protocol family, Jelle reported a further issue in a similar setup. There, multiple interfaces are present, and while remote container connectivity doesn't matter for the container, local connectivity is desired. There are no default routes, but those multiple interfaces all have non-default routes, so we should just pick one and start. Pick the first interface reported by the kernel with any route, if there are no default routes. There should be no harm in doing so. Reported-by: Jelle van der Waa <jvanderwaa@redhat.com> Reported-by: Martin Pitt <mpitt@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2277954 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Holzinger <pholzing@redhat.com>
*	tcp: Don't rely on bind() to fail to decide that connection target is valid	Stefano Brivio	2024-06-19	1	-17/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit e1a2e2780c91 ("tcp: Check if connection is local or low RTT was seen before using large MSS") added a call to bind() before we issue a connect() to the target for an outbound connection. If bind() fails, but neither with EADDRNOTAVAIL, nor with EACCESS, we can conclude that the target address is a local (host) address, and we can use an unlimited MSS. While at it, according to the reasoning of that commit, if bind() succeeds, we would know right away that nobody is listening at that (local) address and port, and we don't even need to call connect(): we can just fail early and reset the connection attempt. But if non-local binds are enabled via net.ipv4.ip_nonlocal_bind or net.ipv6.ip_nonlocal_bind sysctl, binding to a non-local address will actually succeed, so we can't rely on it to fail in general. The visible issue with the existing behaviour is that we would reset any outbound connection to non-local addresses, if non-local binds are enabled. Keep the significant optimisation for local addresses along with the bind() call, but if it succeeds, don't draw any conclusion: close the socket, grab another one, and proceed normally. This will incur a small latency penalty if non-local binds are enabled (we'll likely fetch an existing socket from the pool but additionally call close()), or if the target is local but not bound: we'll need to call connect() and get a failure before relaying that failure back. Link: https://github.com/containers/podman/issues/23003 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	siphash: Remove stale prototypes	David Gibson	2024-06-19	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \|	In fc8f0f8c ("siphash: Use incremental rather than all-at-once siphash functions") we removed the older interface to the SipHash implementation, which took fixed sized blocks of data. However, we forgot to remove the prototypes for those functions, so do that now. Fixes: fc8f0f8c48ef ("siphash: Use incremental rather than all-at-once siphash functions") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Move management of udp[46]_localname into udp_splice_send()	David Gibson	2024-06-14	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	Mostly, udp_sock_handler() is independent of how the datagrams it processes will be forwarded (tap or splice). However, it also updates the msg_name fields for spliced sends, which doesn't really make sense here. Move it into udp_splice_send() which is all about spliced sends. This does potentially mean we'll update the field to the same value several times, but we're going to need this in future anyway: with the extensions the flow table allows, it might not be the same value each time after all. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Rework how we divide queued datagrams between sending methods	David Gibson	2024-06-14	1	-61/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_sock_handler() takes a number of datagrams from sockets that depending on their addresses could be forwarded either to the L2 interface ("tap") or to another socket ("spliced"). In the latter case we can also only send packets together if they have the same source port, and therefore are sent via the same socket. To reduce the total number of system calls we gather contiguous batches of datagrams with the same destination interface and socket where applicable. The determination of what the target is is made by udp_mmh_splice_port(). It returns the source port for splice packets and -1 for "tap" packets. We find batches by looking ahead in our queue until we find a datagram whose "splicefrom" port doesn't match the first in our current batch. udp_mmh_splice_port() is moderately expensive, and unfortunately we can call it twice on the same datagram: once as the (last + 1) entry in one batch (to check it's not in that batch), then again as the first entry in the next batch. Avoid this by keeping track of the "splice port" in the metadata structure, and filling it in one entry ahead of the one we're currently considering. This is a bit subtle, but not that hard. It will also generalise better when we have more complex possibilities based on the flow table. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fold checking of splice flag into udp_mmh_splice_port()	David Gibson	2024-06-14	1	-15/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_mmh_splice_port() is used to determine if a UDP datagram can be "spliced" (forwarded via a socket instead of tap). We only invoke it if the origin socket has the 'splice' flag set. Fold the checking of the flag into the helper itself, which makes the caller simpler. It does mean we have a loop looking for a batch of spliceable or non-spliceable packets even in the case where the flag is clear. This shouldn't be that expensive though, since each call to udp_mmh_splice_port() will return without accessing memory in that case. In any case we're going to need a similar loop in more cases with upcoming flow table work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Split construction of bind socket address from the rest of sock_l4()	David Gibson	2024-06-14	1	-53/+70
\| \| \| \| \| \| \| \| \| \| \| \| \|	sock_l4() creates, binds and otherwise prepares a new socket. It builds the socket address to bind from separately provided address and port. However, we have use cases coming up where it's more natural to construct the socket address in the caller. Prepare for this by adding sock_l4_sa() which takes a pre-constructed socket address, and rewriting sock_l4() in terms of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: use in->buf_size rather than sizeof(pkt_buf)	Laurent Vivier	2024-06-13	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	buf_size is set to sizeof(pkt_buf) by default. And it seems more correct to provide the actual size of the buffer. Later a buf_size of 0 will allow vhost-user mode to detect guest memory buffers. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	iov: remove iov_copy()	Laurent Vivier	2024-06-13	2	-42/+0
\| \| \| \| \| \| \| \| \|	it was needed by a draft version of vhost-user, it is not needed anymore. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	vhost-user: compare mode MODE_PASTA and not MODE_PASST	Laurent Vivier	2024-06-13	6	-21/+21
\| \| \| \| \| \| \| \| \| \|	As we are going to introduce the MODE_VU that will act like the mode MODE_PASST, compare to MODE_PASTA rather than to add a comparison to MODE_VU when we check for MODE_PASST. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: rename udp_sock_handler() to udp_buf_sock_handler()	Laurent Vivier	2024-06-13	3	-5/+5
\| \| \| \| \| \| \| \| \|	We are going to introduce a variant of the function to use vhost-user buffers rather than passt internal buffers. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: refactor UDP header update functions	Laurent Vivier	2024-06-13	1	-26/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit refactors the udp_update_hdr4() and udp_update_hdr6() functions to improve code portability by replacing the udp_meta_t parameter with more specific parameters for the IPv4 and IPv6 headers (iphdr/ipv6hdr) and the source socket address (sockaddr_in/sockaddr_in6). It also moves the tap_hdr_update() function call inside the udp_tap_send() function not to have to pass the TAP header to udp_update_hdr4() and udp_update_hdr6() This refactor reduces complexity by making the functions more modular and ensuring that each function operates on more narrowly scoped data structures. This will facilitate future backend introduction like vhost-user. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: refactor packets handling functions	Laurent Vivier	2024-06-13	2	-49/+64
\| \| \| \| \| \| \| \| \| \| \| \| \|	Consolidate pool_tap4() and pool_tap6() into tap_flush_pools(), and tap4_handler() and tap6_handler() into tap_handler(). Create a generic tap_add_packet() to consolidate packet addition logic and reduce code duplication. The purpose is to ease the export of these functions to use them with the vhost-user backend. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: move buffers management functions to their own file	Laurent Vivier	2024-06-13	5	-550/+654
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move all the TCP parts using internal buffers to tcp_buf.c and keep generic TCP management functions in tcp.c. Add tcp_internal.h to export needed functions from tcp.c and tcp_buf.h from tcp_buf.c With this change we can use existing TCP functions with a different kind of memory storage as for instance the shared memory provided by the guest via vhost-user. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: extract buffer management from tcp_send_flag()	Laurent Vivier	2024-06-13	1	-24/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit isolates the internal data structure management used for storing data (e.g., tcp4_l2_flags_iov[], tcp6_l2_flags_iov[], tcp4_flags_ip[], tcp4_flags[], ...) from the tcp_send_flag() function. The extracted functionality is relocated to a new function named tcp_fill_flag_header(). tcp_fill_flag_header() is now a generic function that accepts parameters such as struct tcphdr and a data pointer. tcp_send_flag() utilizes this parameter to pass memory pointers from tcp4_l2_flags_iov[] and tcp6_l2_flags_iov[]. This separation sets the stage for utilizing tcp_prepare_flags() to set the memory provided by the guest via vhost-user in future developments. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	cppcheck: Suppress constParameterCallback errors	David Gibson	2024-06-08	3	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have several functions which are used as callbacks for NS_CALL() which only read their void * parameter, they don't write it. The constParameterCallback warning in cppcheck 2.14.1 complains that this parameter could be const void , also pointing out that that would require casting the function pointer when used as a callback. Casting the function pointers seems substantially uglier than using a non-const void as the parameter, especially since in each case we cast the void * to a const pointer of specific type immediately. So, suppress these errors. I think it would make logical sense to suppress this globally, but that would cause unmatchedSuppression errors on earlier cppcheck versions. So, instead individually suppress it, along with unmatchedSuppression in the relevant places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Allow access to user_devpts2024_06_07.8a83b53	Derek Schrock	2024-06-07	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allow access to user_devpts. $ pasta --version pasta 0^20240510.g7288448-1.fc40.x86_64 ... $ awk '' < /dev/null $ pasta --version $ While this might be a awk bug it appears pasta should still have access to devpts. Signed-off-by: Derek Schrock <dereks@lifeofadishwasher.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, flow: Fix some error paths which didn't clean up flows properly	David Gibson	2024-06-07	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Flow table entries need to be fully initialised before returning to the main epoll loop. Commit 0060acd1 ("flow: Clarify and enforce flow state transitions") now enforces that: once a flow is allocated we must either cancel it, or activate it before returning to the main loop, or we will hit an ASSERT(). Some error paths in tcp_conn_from_tap() weren't correctly updated for this requirement - we can exit with a flow entry incompletely initialised. Correct that by cancelling the flows in those situations. I don't have enough information to be certain if this is the cause for podman bug 22925, but it plausibly could be. Fixes: 0060acd11b19 ("flow: Clarify and enforce flow state transitions") Link: https://github.com/containers/podman/issues/22925 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Use 'long' to represent millisecond durations	David Gibson	2024-06-07	2	-2/+2
\| \| \| \| \| \| \| \| \| \|	timespec_diff_ms() returns an int representing a duration in milliseconds. This will overflow in about 25 days when an int is 32 bits. The way we use this function, we're probably not going to get a result that long, but it's not outrageously implausible. Use a long for safety. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	lineread: Use ssize_t for line lengths	David Gibson	2024-06-07	3	-10/+9
\| \| \| \| \| \| \| \| \|	Functions and structures in lineread.c use plain int to record and report the length of lines we receive. This means we truncate the result from read(2) in some circumstances. Use ssize_t to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Safer parsing of MAC addresses	David Gibson	2024-06-07	1	-17/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In conf() we parse a MAC address in two places, for the --ns-mac-addr and the -M options. As well as duplicating code, the logic for this parsing has several bugs: * The most serious is that if the given string is shorter than a MAC address should be, we'll access past the end of it. * We don't check the endptr supplied by strtol() which means we could ignore certain erroneous contents * We never check the separator characters between each octet * We ignore certain sorts of garbage that follow the MAC address Correct all these bugs in a new parse_mac() helper. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Use unsigned indices for bits in bitmaps	David Gibson	2024-06-07	2	-7/+7
\| \| \| \| \| \| \| \| \|	A negative bit index in a bitmap doesn't make sense. Avoid this by construction by using unsigned indices. While we're there adjust bitmap_isset() to return a bool instead of an int. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	clang-tidy: Enable the bugprone-macro-parentheses check	David Gibson	2024-06-07	4	-26/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We globally disabled this, with a justification lumped together with several checks about braces. They don't really go together, the others are essentially a stylistic choice which doesn't match our style. Omitting brackets on macro parameters can lead to real and hard to track down bugs if an expression is ever passed to the macro instead of a plain identifier. We've only gotten away with the macros which trigger the warning, because of other conventions its been unlikely to invoke them with anything other than a simple identifier. Fix the macros, and enable the warning for the future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Remove pointless macro parameters in CALL_PROTO_HANDLER	David Gibson	2024-06-07	1	-3/+3
\| \| \| \| \| \| \| \|	The 'c' parameter is always passed exactly 'c'. The 'now' parameter is always passed exactly 'now'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Make rport calculation more local	David Gibson	2024-06-07	1	-2/+1
\| \| \| \| \| \| \| \| \|	cppcheck 2.14.1 complains about the rport variable not being in as small as scope as it could be. It's also only used once, so we might as well just open code the calculation for it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Make pointer const in tcp_revert_seq	David Gibson	2024-06-07	1	-1/+1
\| \| \| \| \| \| \| \| \|	The th pointer could be const, which causes a cppcheck warning on at least some cppcheck versions (e.g. Cppcheck 2.13.0 in Fedora 40). Fixes: e84a01e94c9f ("tcp: move seq_to_tap update to when frame is queued") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Remove log_to_stdout option	David Gibson	2024-06-05	2	-6/+3
\| \| \| \| \| \| \| \|	Now that we've simplified how usage() works, nothing ever sets the log_to_stdout flag. Eliminate it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Don't print usage via the logging subsystem	David Gibson	2024-06-05	1	-160/+166
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The message from usage() when given invalid options, or the -h / --help option is currently printed by many calls to the info() function, also used for runtime logging of informational messages. That isn't useful: the usage message should always go to the terminal (stdout or stderr), never syslog or a logfile. It should never be filtered by priority. Really the only thing using the common logging functions does is give more opportunities for something to go wrong. Replace all the info() calls with direct fprintf() calls. This does mean manually adding "\n" to each message. A little messy, but worth it for the simplicity in other dimensions. While we're there make much heavier use of single strings containing multiple lines of output text. That reduces the number of fprintf calls, reducing visual clutter and making it easier to see what the output will look like from the source. Link: https://bugs.passt.top/show_bug.cgi?id=90 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Remove unhelpful usage() wrapper	David Gibson	2024-06-05	1	-13/+4
\| \| \| \| \| \| \| \| \|	usage() does nothing but call print_usage() with EXIT_FAILURE as a parameter. It's no more complex to just give that parameter at the single call site. Eliminate it and rename print_usage() to just usage(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: move seq_to_tap update to when frame is queued	Jon Maloy	2024-06-05	1	-22/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") delayed update of conn->seq_to_tap until the moment the corresponding frame has been successfully pushed out. This has the advantage that we immediately can make a new attempt to transmit a frame after a failed trasnmit, rather than waiting for the peer to later discover a gap and trigger the fast retransmit mechanism to solve the problem. This approach has turned out to cause a problem with spurious sequence number updates during peer-initiated retransmits, and we have realized it may not be the best way to solve the above issue. We now restore the previous method, by updating the said field at the moment a frame is added to the outqueue. To retain the advantage of having a quick re-attempt based on local failure detection, we now scan through the part of the outqueue that had do be dropped, and restore the sequence counter for each affected connection to the most appropriate value. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	apparmor: Fix comments after PID file and AF_UNIX socket creation refactoring2024_05_23.765eb0b	Stefano Brivio	2024-05-23	3	-7/+13
\| \| \| \| \| \| \| \| \| \| \| \|	Now: - we don't open the PID file in main() anymore - PID file and AF_UNIX socket are opened by pidfile_open() and tap_sock_unix_open() - write_pidfile() becomes pidfile_write() Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Richard W.M. Jones <rjones@redhat.com>
*	conf, passt.h: Rename pid_file in struct ctx to pidfile	Stefano Brivio	2024-05-23	2	-6/+6
\| \| \| \| \| \| \|	We have pidfile_fd now, pid_file_fd would be quite ugly. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	conf, passt, tap: Open socket and PID files before switching UID/GID	Stefano Brivio	2024-05-23	5	-11/+28
\| \| \| \| \| \| \| \| \| \| \| \|	Otherwise, if the user runs us as root, and gives us paths that are only accessible by root, we'll fail to open them, which might in turn encourage users to change permissions or ownerships: definitely a bad idea in terms of security. Reported-by: Minxi Hou <mhou@redhat.com> Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Richard W.M. Jones <rjones@redhat.com>
*	passt, util: Move opening of PID file to its own function	Stefano Brivio	2024-05-23	3	-9/+25
\| \| \| \| \| \| \|	We won't call it from main() any longer: move it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	util: Rename write_pidfile() to pidfile_write()	Stefano Brivio	2024-05-23	3	-5/+5
\| \| \| \| \| \| \| \|	As I'm adding pidfile_open() in the next patch. The function comment didn't match, by the way. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	tap: Split tap_sock_unix_init() into opening and listening parts	Stefano Brivio	2024-05-23	1	-12/+27
\| \| \| \| \| \| \| \| \|	We'll need to open and bind the socket a while before listening to it, so split that into two different functions. No functional changes intended. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	passt, tap: Don't use -1 as uninitialised value for fd_tap_listen	Stefano Brivio	2024-05-23	2	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a remnant from the time we kept access to the original filesystem and we could reinitialise the listening AF_UNIX socket. Since commit 0515adceaa8f ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application"), however, we can't re-bind the listening socket once we're up and running. Drop the -1 initalisation and the corresponding check. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Move all-ones initialisation of mac_guest to tap_sock_init()	Stefano Brivio	2024-05-23	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It has nothing to do with tap_sock_unix_init(). It used to be there as that function could be called multiple times per passt instance, but it's not the case anymore. This also takes care of the fact that, with --fd, we wouldn't set the initial MAC address, so we would need to wait for the guest to send us an ARP packet before we could exchange data. Fixes: 6b4e68383c66 ("passt, tap: Add --fd option") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Richard W.M. Jones <rjones@redhat.com>
*	conf: Don't lecture user about starting us as root	Stefano Brivio	2024-05-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	libguestfs tools have a good reason to run as root: if the guest image is owned by root, it would be counterproductive to encourage users to invoke them as non-root, as it would require changing permissions or ownership of the image file. And if they run as root, we'll start as root, too. Warn users we'll switch to 'nobody', but don't tell them what to do. Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	netlink, test: Ignore deprecated addresses	David Gibson	2024-05-22	6	-7/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we retrieve or copy host addresses we can include deprecated addresses, which is not what we want. Adjust our logic to exclude them. Similarly our tests can retrieve deprecated addresses, so exclude them there too. I hit this in practice because my router sometimes temporarily advertises an fd00:: prefix before the real delegated IPv6 prefix. The deprecated address can hang around for some time messing up my tests. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove interim 'tapside' field from connection	David Gibson	2024-05-22	2	-8/+6
\| \| \| \| \| \| \| \| \| \|	We recently introduced this field to keep track of which side of a TCP flow is the guest/tap facing one. Now that we generically record which pif each side of each flow is connected to, we can easily derive that, and no longer need to keep track of it explicitly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Record the pifs for each side of each flow	David Gibson	2024-05-22	7	-16/+109
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we have no generic information flows apart from the type and state, everything else is specific to the flow type. Start introducing generic flow information by recording the pifs which the flow connects. To keep track of what information is valid, introduce new flow states: INI for when the initiating side information is complete, and TGT for when both sides information is complete, but we haven't chosen the flow type yet. For now, these states don't do an awful lot, but they'll become more important as we add more generic information. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Make side 0 always be the initiating side	David Gibson	2024-05-22	7	-27/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Each flow in the flow table has two sides, 0 and 1, representing the two interfaces between which passt/pasta will forward data for that flow. Which side is which is currently up to the protocol specific code: TCP uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except for spliced connections where it uses 0 for the initiating side and 1 for the target side. ICMP also uses 0 for the host/"sock" side and 1 for the guest/"tap" side, but in its case the latter is always also the initiating side. Make this generically consistent by always using side 0 for the initiating side and 1 for the target side. This doesn't simplify a lot for now, and arguably makes TCP slightly more complex, since we add an extra field to the connection structure to record which is the guest facing side. This is an interim change, which we'll be able to remove later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>q Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Clarify and enforce flow state transitions	David Gibson	2024-05-22	6	-69/+183
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Flows move over several different states in their lifetime. The rules for these are documented in comments, but they're pretty complex and a number of the transitions are implicit, which makes this pretty fragile and error prone. Change the code to explicitly track the states in a field. Make all transitions explicit and logged. To the extent that it's practical in C, enforce what can and can't be done in various states with ASSERT()s. While we're at it, tweak the docs to clarify the restrictions on each state a bit. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Better helpers for using inany and specific family addrs together	David Gibson	2024-05-22	3	-37/+106
\| \| \| \| \| \| \| \| \|	This adds some extra inany helpers for comparing an inany address to addresses of a specific family (including special addresses), and building an inany from an IPv4 address (either statically or at runtime). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Properly type callbacks to protocol specific handlers	David Gibson	2024-05-22	6	-25/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The flow dispatches deferred and timer handling for flows centrally, but needs to call into protocol specific code for the handling of individual flows. Currently this passes a general union flow *. It makes more sense to pass the specific relevant flow type structure. That brings the check on the flow type adjacent to casting to the union variant which it tags. Arguably, this is a slight abstraction violation since it involves the generic flow code using protocol specific types. It's already calling into protocol specific functions, so I don't think this really makes any difference. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util, tcp: Add helper to display socket addresses	David Gibson	2024-05-22	3	-14/+79
\| \| \| \| \| \| \| \| \| \| \|	When reporting errors, we sometimes want to show a relevant socket address. Doing so by extracting the various relevant fields can be pretty awkward, so introduce a sockaddr_ntop() helper to make it simpler. For now we just have one user in tcp.c, but I have further upcoming patches which can make use of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	apparmor: Fix passt abstraction	Maxime Bélair	2024-05-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit b686afa2 introduced the invalid apparmor rule `mount options=(rw, runbindable) /,` since runbindable mount rules cannot have a source. Therefore running aa-logprof/aa-genprof will trigger errors (see https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685) $ sudo aa-logprof ERROR: Operation {'runbindable'} cannot have a source. Source = AARE('/') This patch fixes it to the intended behavior. Link: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685 Fixes: b686afa23e85 ("apparmor: Explicitly pass options we use while remounting root filesystem") Signed-off-by: Maxime Bélair <maxime.belair@canonical.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	apparmor: allow netns paths on /tmp	Paul Holzinger	2024-05-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For some unknown reason "owner" makes it impossible to open bind mounted netns references as apparmor denies it. In the kernel denied log entry we see ouid=0 but it is not clear why that is as the actual file is owned by the real (rootless) user id. In abstractions/pasta there is already `@{run}/user/@{uid}/**` without owner set for the same reason as this path contains the netns path by default when running under Podman. Fixes: 72884484b00d ("apparmor: allow read access on /tmp for pasta") Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>