passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	fedora: Replace pasta hard links by separate builds	Stefano Brivio	2023-09-07	1	-6/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The hard link trick didn't actually fix the issue with SELinux file contexts properly: as opposed to symbolic links, SELinux now correctly associates types to the labels that are set -- except that those labels are now shared, so we can end up (depending on how rpm(8) extracts the archives) with /usr/bin/passt having a pasta_exec_t context. This got rather confusing as running restorecon(8) seemed to fix up labels -- but that's simply toggling between passt_exec_t and pasta_exec_t for both links, because each invocation will just "fix" the file with the mismatching context. Replace the hard links with two separate builds of the binary, as suggested by David. The build is reproducible, so we pass "-pasta" in the VERSION for pasta's build. This is wasteful but better than the alternative. Just copying the binary over would otherwise cause issues with debuginfo packages due to duplicate Build-IDs -- and rpmbuild(8) also warns about them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	apparmor: Add pasta's own profile	Stefano Brivio	2023-09-07	3	-10/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If pasta and pasta.avx2 are hard links to passt and passt.avx2, AppArmor will attach their own profiles on execution, and we can restrict passt's profile to what it actually needs. Note that pasta needs to access all the resources that passt needs, so the pasta abstraction still includes passt's one. I plan to push the adaptation required for the Debian package in commit 5bb812e79143 ("debian/rules: Override pasta symbolic links with hard links"), on Salsa. If other distributions need to support AppArmor profiles they can follow a similar approach. The profile itself will be installed, there, via dh_apparmor, in a separate commit, b52557fedcb1 ("debian/rules: Install new pasta profile using dh_apparmor"). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	apparmor: Allow pasta to remount /proc, access entries under its own copy	Stefano Brivio	2023-09-07	1	-0/+7
\| \| \| \| \| \| \| \| \| \|	Since commit b0e450aa8500 ("pasta: Detach mount namespace, (re)mount procfs before spawning command"), we need to explicitly permit mount of /proc, and access to entries under /proc/PID/net (after remount, that's what AppArmor sees as path). Fixes: b0e450aa8500 ("pasta: Detach mount namespace, (re)mount procfs before spawning command") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	apparmor: Allow read-only access to uid_map	Stefano Brivio	2023-09-07	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Starting with commit 770d1a4502dd ("isolation: Initially Keep CAP_SETFCAP if running as UID 0 in non-init"), the lack of this rule became more apparent as pasta needs to access uid_map in procfs even as non-root. However, both passt and pasta needs this, in case they are started as root, so add this directly to passt's abstraction (which is sourced by pasta's profile too). Fixes: 770d1a4502dd ("isolation: Initially Keep CAP_SETFCAP if running as UID 0 in non-init") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	apparmor: Explicitly pass options we use while remounting root filesystem	Stefano Brivio	2023-09-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As a result of AppArmor commit d4b0fef10a4a ("parser: fix rule flag generation change_mount type rules"), we can't expect anymore to get permission to mount() / read-write, with MS_REC \| MS_UNBINDABLE ("runbindable", in AppArmor terms), if we don't explicitly pass those flags as options. It used to work by mistake. Now, the reasonable expectation would be that we could just change the existing rule into: mount options=(rw, runbindable) "" -> /, ...but this now fails to load too, I think as a result of AppArmor commit 9d3f8c6cc05d ("parser: fix parsing of source as mount point for propagation type flags"). It works with 'rw' alone, but 'runbindable' is indeed a propagation type flag. Skip the source specification, it doesn't add anything meaningful to the rule anyway. Reported-by: Paul Holzinger <pholzing@redhat.com> Link: https://github.com/containers/podman/pull/19751 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	apparmor: Use abstractions/nameservice to deal with symlinked resolv.conf	Stefano Brivio	2023-09-06	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	While abstractions/nameservice appeared too broad and overkill for our simple need (read-only resolv.conf access), it properly deals with symlinked resolv.conf files generated by systemd-resolved, NetworkManager or suchlike. If we just grant read-only access to /etc/resolv.conf, we'll fail to read nameserver information in rather common configurations, because AppArmor won't follow the symlink. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Strip RTA_PREFSRC when copying routes to the namespace2023_08_23.a7e4bfb	David Gibson	2023-08-23	1	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Host routes can include a preferred source address (RTA_PREFSRC), which must be one of the host's addresses. However when using pasta with -a the namespace might be given a different address, not on the host. This seems to occur pretty routinely depending on the network configuration systems in place on the host. With --config-net we will try to copy host routes to the namespace. If one of those includes an RTA_PREFSRC, but the namespace doesn't have the host address, this will fail with -EINVAL, causing pasta to fail. Fix this by stripping off RTA_PREFSRC attributes from routes as we copy them to the namespace. This is by no means infallible, bit it should at least handle common cases for the time being. Link: https://bugs.passt.top/show_bug.cgi?id=71 Link: https://github.com/containers/podman/pull/19699#issuecomment-1688769287 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Set IFA_ADDRESS, not just IFA_LOCAL, while adding IPv4 addresses	Stefano Brivio	2023-08-23	1	-0/+1
\| \| \| \| \| \| \| \| \|	Otherwise, we actually configure the address, but it's not usable because no local route is added by the kernel. Link: https://github.com/containers/podman/pull/19699 Fixes: cfe7509e5c16 ("netlink: Use struct in_addr for IPv4 addresses, not bare uint32_t") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove broken pressure calculations for tcp_defer_handler()	David Gibson	2023-08-22	3	-13/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_defer_handler() performs a potentially expensive linear scan of the connection table. So, to mitigate the cost of that we skip if if we're not under at least moderate pressure: either 30% of available connections or 30% (estimated) of available fds used. But, the calculation for this has been broken since it was introduced: we calculate "max_conns" based on c->tcp.conn_count, not TCP_MAX_CONNS, meaning we only exit early if conn_count is less than 30% of itself, i.e. never. If that calculation is "corrected" to be based on TCP_MAX_CONNS, it completely tanks the TCP CRR times for passt - from ~60ms to >1000ms on my laptop. My guess is that this is because in the case of many short lived connections, we're letting the table become much fuller before compacting it. That means that other places which perform a table scan now have to do much, much more. For the time being, simply remove the tests, since they're not doing anything useful. We can reintroduce them more carefully if we see a need for them. This also removes the only user of c->tcp.splice_conn_count, so that can be removed as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	inany: Add missing double include guard to inany.h	David Gibson	2023-08-22	1	-0/+5
\| \| \| \| \| \| \|	This was overlooked when the file was created. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Move in_epoll flag out of common connection structure	David Gibson	2023-08-22	3	-7/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The in_epoll boolean is one of only two fields (currently) in the common structure shared between tap and spliced connections. It seems like it belongs there, because both tap and spliced connections use it, and it has roughly the same meaning. Roughly, however, isn't exactly: which fds this flag says are in the epoll varies between the two connection types, and are in type specific fields. So, it's only possible to meaningfully use this value locally in type specific code anyway. This common field is going to get in the way of more widespread generalisation of connection / flow tracking, so move it to separate fields in the tap and splice specific structures. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Don't pre-fill IPv4 destination address in headers	David Gibson	2023-08-22	8	-32/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because packets sent on the tap interface will always be going to the guest/namespace, we more-or-less know what address they'll be going to. So we pre-fill this destination address in our header buffers for IPv4. We can't do the same for IPv6 because we could need either the global or link-local address for the guest. In future we're going to want more flexibility for the destination address, so this pre-filling will get in the way. Change the flow so we always fill in the IPv4 destination address for each packet, rather than prefilling it from proto_update_l2_buf(). In fact for TCP we already redundantly filled the destination for each packet anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Don't include destination address in partially precomputed csums	David Gibson	2023-08-22	3	-52/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We partially prepopulate IP and TCP header structures including, amongst other things the destination address, which for IPv4 is always the known address of the guest/namespace. We partially precompute both the IPv4 header checksum and the TCP checksum based on this. In future we're going to want more flexibility with controlling the destination for IPv4 (as we already do for IPv6), so this precomputed value gets in the way. Therefore remove the IPv4 destination from the precomputed checksum and fold it into the checksum update when we actually send a packet. Doing this means we no longer need to recompute those partial sums when the destination address changes ({tcp,udp}_update_l2_buf()) and instead the computation can be moved to compile time. This means while we perform slightly more computations on each packet, we slightly reduce the amount of memory we need to access. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Consistent usage of ports in tcp_seq_init()	David Gibson	2023-08-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tcp_seq_init() the meaning of "src" and "dst" isn't really clear since it's used for connections in both directions. However, these values are just feeding a hash, so as long as we're consistent and include all the information we want, it doesn't really matter. Oddly, for the "src" side we supply the (tap side) forwarding address but the (tap side) endpoint port. This again doesn't really matter, but it's confusing. So swap this with dstport, so "src" is always forwarding and "dst" is always endpoint. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: More precise terms for addresses and ports	David Gibson	2023-08-22	2	-52/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a number of places the comments and variable names we use to describe addresses and ports are ambiguous. It's not sufficient to describe a port as "tap-facing" or "socket-facing", because on both the tap side and the socket side there are two ports for the two ends of the connection. Similarly, "local" and "remote" aren't particularly helpful, because it's not necessarily clear whether we're talking from the point of view of the guest/namespace, the host, or passt itself. This patch makes a number of changes to be more precise about this. It introduces two new terms in aid of this: A "forwarding" address (or port) refers to an address which is local from the point of view of passt itself. That is a source address for traffic sent by passt, whether it's to the guest via the tap interface or to a host on the internet via a socket. The "endpoint" address (or port) is the reverse: a remote address from passt's point of view, the destination address for traffic sent by passt. Between them the "side" (either tap/guest-facing or sock/host-facing) and forwarding vs. endpoint unambiguously describes which address or port we're talking about. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Pass source address to protocol handler functions	David Gibson	2023-08-22	7	-32/+48
\| \| \| \| \| \| \| \| \| \| \|	The tap code passes the IPv4 or IPv6 destination address of packets it receives to the protocol specific code. Currently that protocol code doesn't use the source address, but we want it to in future. So, in preparation, pass the IPv4/IPv6 source address of tap packets to those functions as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Don't clobber source address in tap6_handler()	David Gibson	2023-08-22	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tap6_handler() saddr is initialized to the IPv6 source address from the incoming packet. However part way through, but before organizing the packet into a "sequence" we set it unconditionally to the guest's assigned address. We don't do anything equivalent for IPv4. This doesn't make a lot of sense: if the guest is using a different source address it makes sense to consider these different sequences of packets and we shouldn't try to combine them together. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Fix domain transitions for typical commands pasta might run2023_08_18.0af928e	Stefano Brivio	2023-08-18	1	-1/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...now it gets ugly. If we use pasta without an existing target namespace, and run commands directly or spawn a shell, and keep the pasta_t domain when we do, they won't be able to do much: a shell might even start, but it's not going to be usable, or to even display a prompt. Ideally, pasta should behave like a shell when it spawns a command: start as unconfined_t and automatically transition to whatever domain is associated in the specific policy for that command. But we can't run as unconfined_t, of course. It would seem natural to switch to unconfined_t "just before", so that the default transitions happen. But transitions can only happen when we execvp(), and that's one single transition -- not two. That is, this approach would work for: pasta -- sh -c 'ip address show' but not for: pasta -- ip address show If we configure a transition to unconfined_t when we run ip(8), we'll really try to start that as unconfined_t -- but unconfined_t isn't allowed as entrypoint for ip(8) itself, and execvp() will fail. However, there aren't many different types of binaries pasta might commonly run -- for example, we're unlikely to see pasta used to run a mount(8) command. Explicitly set up domain transition for common stuff -- switching to unconfined_t for bin_t and shells works just fine, ip(8), ping(8), arping(8) and similar need a different treatment. While at it, allow commands we spawn to inherit resource limits and signal masks, because that's what happens by default, and don't require AT_SECURE sanitisation of the environment (because that won't happen by default). Slightly unrelated: we also need to explicitly allow pasta_t to use TTYs, not just PTYs, otherwise we can't keep stdin and stdout open for shells. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Allow pasta_t to read nsfs entries	Stefano Brivio	2023-08-18	1	-0/+2
\| \| \| \| \| \| \| \| \|	This is needed to monitor filesystem-bound namespaces and quit when they're gone -- this feature never really worked with SELinux. Fixes: 745a9ba4284c ("pasta: By default, quit if filesystem-bound net namespace goes away") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Richard W.M. Jones <rjones@redhat.com>
*	selinux: Add rules for sysctl and /proc/net accesses	Stefano Brivio	2023-08-18	2	-0/+4
\| \| \| \| \| \| \| \| \|	That's what we actually need to check networking-related sysctls, to scan for bound ports, and to manipulate bits of network configuration inside pasta's target namespaces. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Richard W.M. Jones <rjones@redhat.com>
*	selinux: Update policy to fix user/group settings	Stefano Brivio	2023-08-18	2	-4/+13
\| \| \| \| \| \| \| \| \|	Somehow most of this used to work on older kernels, but now we need to explicitly permit setuid, setgid, and setcap capabilities, as well as read-only access to passwd (as we support running under a given login name) and sssd library facilities. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Fix user namespace creation after breaking kernel change	Stefano Brivio	2023-08-18	2	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Kernel commit ed5d44d42c95 ("selinux: Implement userns_create hook") seems to just introduce a new functionality, but given that SELinux implements a form of mandatory access control, introducing the new permission breaks any application (shipping with SELinux policies) that needs to create user namespaces, such as passt and pasta for sandboxing purposes. Add the new 'allow' rules. They appear to be backward compatible, kernel-wise, and the policy now requires the new 'user_namespace' class to build, but that's something distributions already ship. Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	selinux: Use explicit paths for binaries in file context	Stefano Brivio	2023-08-18	2	-2/+4
\| \| \| \| \| \| \| \| \|	There's no reason to use wildcards, and we don't want any similarly-named binary (not that I'm aware of any) to risk being associated to passt_exec_t and pasta_exec_t by accident. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	fedora: Install pasta as hard link to ensure SELinux file context match	Stefano Brivio	2023-08-18	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Makefile installs symbolic links by default, which actually worked at some point (not by design) with SELinux, but at least on recent kernel versions it doesn't anymore: override pasta (and pasta.avx2) with hard links. Otherwise, even if the links are labeled as pasta_exec_t, SELinux will "resolve" them to passt_exec_t, and we'll have pasta running as passt_t instead of pasta_t. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Richard W.M. Jones <rjones@redhat.com>
*	tap: Fix format specifier in tap4_is_fragment() warning	Stefano Brivio	2023-08-16	1	-1/+2
\| \| \| \| \| \| \| \|	Spotted by Coverity, relatively harmless. Fixes: e01759e2fab0 ("tap: Explicitly drop IPv4 fragments, and give a warning") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Don't propagate host address expiry to the container	David Gibson	2023-08-16	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we copy addresses from the host to the container in nl_addr_dup(), we copy all the address's attributes, including IFA_CACHEINFO, which controls the address's lifetime. If the host address is managed by, for example, DHCP, it will typically have a finite lifetime. When we copy that lifetime to the pasta container, that lifetime will remain, meaning the kernel will eventually remove the address, typically some hours later. The container, however, won't have the DHCP client or whatever was managing and maintaining the address in the host, so it will just lose connectivity. Long term, we may want to monitor host address changes and reflect them to the guest. But for now, we just want to take a snapshot of the host's address and set those in the container permanently. We can accomplish that by stripping off the IFA_CACHEINFO attribute as we copy addresses. Link: https://github.com/containers/podman/issues/19405 Link: https://bugs.passt.top/show_bug.cgi?id=70 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Correctly calculate attribute length for address messages	David Gibson	2023-08-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	In nl_addr_get() and nl_addr_dup() we step the attributes attached to each RTM_NEWADDR message with a loop initialised with IFA_RTA() and RTM_PAYLOAD() macros. RTM_PAYLOAD(), however is for RTM_NEWROUTE messages (struct rtmsg), not RTM_NEWADDR messages (struct ifaddrmsg). Consequently it miscalculates the size and means we can skip some attributes. Switch to IFA_PAYLOAD() which we should be using here. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Remove redundant check on nlmsg_type	David Gibson	2023-08-16	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \|	In the loop within nl_addr_dup() we check and skip for any messages that aren't of type RTM_NEWADDR. This is a leftover that was missed in the recent big netlink cleanup. In fact we already check for the message type in the nl_foreach_oftype() macro, so the explicit test is redudant. Remove it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Demote overlapping port ranges error to a warning	David Gibson	2023-08-13	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We give a fatal error if the port ranges from any port forwarding specifiers overlap. This occurs even if those port ranges are specifically bound to different addresses, so there's not really any overlap. Right now, we can't 100% handle this case correctly, because our data structures don't have a way to represent per-address forwarding. However, there are a number of cases that will actually work just fine: e.g. mapping the same port to the same port on two different addresses (say :: and 127.0.0.1). We have long term plans to fix this properly, but that is still some time away. For the time being, demote this error to a warning so that the cases that already work will be allowed. Link: https://bugs.passt.top/show_bug.cgi?id=56 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Use different epoll types for passt and pasta tap fds	David Gibson	2023-08-13	4	-30/+30
\| \| \| \| \| \| \| \| \| \| \|	Currently we have a single epoll event type for the "tap" fd, which could be either a handle on a /dev/net/tun device (pasta) or a connected Unix socket (passt). However for the two modes we call different handler functions. Simplify this a little by using different epoll types and dispatching directly to the correct handler function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Split listening Unix domain socket into its own type	David Gibson	2023-08-13	4	-16/+15
\| \| \| \| \| \| \| \| \| \| \| \| \|	tap_handler() actually handles events on three different types of object: the /dev/tap character device (pasta), a connected Unix domain socket (passt) or a listening Unix domain socket (passt). The last, in particular, really has no handling in common with the others, so split it into its own epoll type and directly dispatch to the relevant handler from the top level. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Split handling of listening TCP sockets into their own handler	David Gibson	2023-08-13	7	-45/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_sock_handler() handles both listening TCP sockets, and connected TCP sockets, but what it needs to do in those cases has essentially nothing in common. Therefore, give listening sockets their own epoll_type value and dispatch directly to their own handler from the top level. Furthermore, the two handlers need essentially entirely different information from the reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate the port for the listening socket, but that's not the same meaning. So, switch listening sockets to their own reference type which we can lay out as we please. That lets us remove the listen and outbound fields from the normal (connected) tcp_epoll_ref, reducing it to just the connection table index. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Split handling of TCP timerfds into its own handler function	David Gibson	2023-08-13	4	-14/+12
\| \| \| \| \| \| \| \| \| \| \| \|	tcp_sock_handler() actually handles several different types of fd events. This includes timerfds that aren't sockets at all. The handling of these has essentially nothing in common with the other cases. So, give the TCP timers there own epoll_type value and dispatch directly to their handler. This also means we can remove the timer field from tcp_epoll_ref, the information it encoded is now implicit in the epoll_type value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Tiny cleanup to udp_sock_handler()	David Gibson	2023-08-13	2	-3/+2
\| \| \| \| \| \| \| \| \|	Move the test for c->no_udp into the function itself, rather than in the dispatching switch statement to better localize the UDP specific logic, and make for greated consistency with other handler functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Split handling of ICMP and ICMPv6 sockets	David Gibson	2023-08-13	3	-56/+70
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have different epoll type values for ICMP and ICMPv6 sockets, but they both call the same handler function, icmp_sock_handler(). However that function does essentially nothing in common for the two cases. So, split it into icmp_sock_handler() and icmpv6_sock_handler() and dispatch them separately from the top level. While we're there remove some parameters that the function was never using anyway. Also move the test for c->no_icmp into the functions, so that all the logic specific to ICMP is within the handler, rather than in the top level dispatch code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Fold sock_handler into general switch on epoll event fd	David Gibson	2023-08-13	1	-27/+27
\| \| \| \| \| \| \| \| \| \| \| \|	When we process events from epoll_wait(), we check for a number of special cases before calling sock_handler() which then dispatches based on the protocol type of the socket in the event. Fold these cases together into a single switch on the fd type recorded in the epoll data field. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Always use epoll_ref for the epoll data variable	David Gibson	2023-08-13	4	-12/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	epoll_ref contains a variety of information useful when handling epoll events on our sockets, and we place it in the epoll_event data field returned by epoll. However, for a few other things we use the 'fd' field in the standard union of types for that data field. This actually introduces a bug which is vanishingly unlikely to hit in practice, but very nasty if it ever did: theoretically if we had a very large file descriptor number for fd_tap or fd_tap_listen it could overflow into bits that overlap with the 'proto' field in epoll_ref. With some very bad luck this could mean that we mistakenly think an event on a regular socket is an event on fd_tap or fd_tap_listen. More practically, using different (but overlapping) fields of the epoll_data means we can't unify dispatch for the various different objects in the epoll. Therefore use the same epoll_ref as the data for the tap fds and the netns quit fd, adding new fd type values to describe them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Generalize epoll_ref to cover things other than sockets	David Gibson	2023-08-13	8	-56/+86
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The epoll_ref type includes fields for the IP protocol of a socket, and the socket fd. However, we already have a few things in the epoll which aren't protocol sockets, and we may have more in future. Rename these fields to an abstract "fd type" and file descriptor for more generality. Similarly, rather than using existing IP protocol numbers for the type, introduce our own number space. For now these just correspond to the supported protocols, but we'll expand on that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Fold reset handling into tap_handler_passt()	David Gibson	2023-08-13	1	-36/+32
\| \| \| \| \| \| \| \| \| \| \|	We call tap_sock_reset() if tap_handler_passt() fails, or if we get an error event on the socket. Fold that logic into tap_handler() passt itself which simplifies the caller. It also makes it clearer that we had a redundant EPOLL_CTL_DEL and close() in one of the reset paths, so fix that too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Fold reset handling into tap_handler_pasta()	David Gibson	2023-08-13	1	-15/+15
\| \| \| \| \| \| \| \| \|	If tap_handler_pasta() fails, we reset the connection. But in the case of pasta the "reset" is just a fatal error. Fold the die() calls directly into tap_handler_pasta() for simplicity. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Clean up behaviour for errors on listening Unix socket	David Gibson	2023-08-13	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We call tap_sock_unix_new() to handle a new connection to the qemu socket if we get an EPOLLIN event on c->fd_tap_listen. If we get any other event on the fd, we'll fall through to the "tap reset" path. But that won't do anything relevant to the listening socket, it will just close the already connected socket. Furthermore, the only other event we're subscribed to for the listening socket is EPOLLRDHUP, which doesn't apply to a non connected socket. Remove EPOLLRDHUP from the subscribed events. We don't need to explicitly add EPOLLERR, because errors are always reported. There's no obvious case that would cause an error on a listening socket anyway, and it's not obvious how we'd recover, treat it as a fatal error if it ever does happen. Finally, fold all this handling into the tap_sock_unix_new() function, there's no real reason to split it between there and tap_handler(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Clean up tap reset path	David Gibson	2023-08-13	1	-23/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tap_handler() if we get an error on the tap device or socket, we use tap_sock_init() to re-initialise it. However, what we actually need for this reset case has remarkably little in common with the case where we're initialising for the first time: * Re-initialising the packet pools is unnecessary * The case of a passed in fd (--fd) isn't relevant * We don't even call this for pasta mode * We will never re-call tap_sock_unix_init() because we never clear fd_tap_listen In fact the only thing we do in tap_sock_init() relevant to the reset case is to remove the fd from the epoll and close it... which isn't used in the first initialisation case. So make a new tap_sock_reset() function just for this case, and simplify tap_sock_init() slightly as being used only for the first time case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: fix seq->p.count limit	Laurent Vivier	2023-08-13	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \|	The number of items in pool_l4_t is defined to UIO_MAXIOV, not TAP_SEQS. TAP_SEQS is the number of the sequences. Fix the value used to compare seq->p.count with. Fixes: 37c228ada88b ("tap, tcp, udp, icmp: Cut down on some oversized buffers") Signed-off-by: Laurent Vivier <lvivier@redhat.com> [sbrivio: s/messages/sequences/ in commit message, extend initialisation of packets in pool to UIO_MAXIOV items] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Propagate errors for "dup" operations	David Gibson	2023-08-04	3	-26/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We now detect errors on netlink "set" operations while configuring the pasta namespace with --config-net. However in many cases rather than a simple "set" we use a more complex "dup" function to copy configuration from the host to the namespace. We're not yet properly detecting and reporting netlink errors for that case. Change the "dup" operations to propagate netlink errors to their caller, pasta_ns_conf() and report them there. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Propagate errors for "dump" operations	David Gibson	2023-08-04	3	-22/+70
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if we receive any netlink errors while discovering network configuration from the host, we'll just ignore it and carry on. This might lead to cryptic error messages later on, or even silent misconfiguration. We now have the mechanisms to detect errors from get/dump netlink operations. Propgate these errors up to the callers and report them usefully. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Always process all responses to a netlink request	David Gibson	2023-08-04	1	-24/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A single netlink request can result in multiple response datagrams. We process multiple response datagrams in some circumstances, but there are cases where we exit early and will leave remaining datagrams in the queue. These will be flushed in nl_send() before we send another request. This is confusing, and not what we need to reliably check for errors from netlink operations. So, instead, make sure we always process all the response datagrams whenever we send a request (excepting fatal errors). In most cases this is just a matter of avoiding early exits from nl_foreach loops. nl_route_dup() is a bit trickier, because we need to retain all the routes we're going to try to copy in a single buffer. Here we instead use a secondary buffer to flush any remaining datagrams, and report an error if there are any additional routes in those datagrams . Link: https://bugs.passt.top/show_bug.cgi?id=67 Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Propagate errors for "set" operations	David Gibson	2023-08-04	3	-24/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently if anything goes wrong while we're configuring the namespace network with --config-net, we'll just ignore it and carry on. This might lead to a silently unconfigured or misconfigured namespace environment. For simple "set" operations based on nl_do() we can now detect failures reported via netlink. Propagate those errors up to pasta_ns_conf() and report them usefully. Link: https://bugs.passt.top/show_bug.cgi?id=60 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting changes in pasta_ns_conf()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Add nl_foreach_oftype to filter response message types	David Gibson	2023-08-04	1	-15/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In most cases where processing response messages, we expect only one type of message (excepting NLMSG_DONE or NLMSG_ERROR), and so we need a test and continue to skip anything else. Add a helper macro to do this. This also fixes a bug in nl_get_ext_if() where we didn't have such a test and if we got a message other than RTM_NEWROUTE we would have parsed its contents as nonsense. Also add a warning message if we get such an unexpected message type, which could be useful for debugging if we ever hit it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Split nl_req() to allow processing multiple response datagrams	David Gibson	2023-08-04	1	-68/+113
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently nl_req() sends the request, and receives a single response datagram which we then process. However, a single request can result in multiple response datagrams. That happens nearly all the time for DUMP requests, where the 'DONE' message usually comes in a second datagram after the NEW{LINK\|ADDR\|ROUTE} messages. It can also happen if there are just too many objects to dump in a single datagram. Allow our netlink code to process multiple response datagrams by splitting nl_req() into three different helpers: nl_send() just sends a request, without getting a response. nl_status() checks a single message to see if it indicates the end of the reponses for our request. nl_next() moves onto the next response message, whether it's in a datagram we already received or we need to recv() a new one. We also add a 'for'-style macro to use these to step through every response message to a request across multiple datagrams. While we're at it, be more thourough with checking that our sequence numbers are in sync. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Clearer reasoning about the netlink response buffer size	David Gibson	2023-08-04	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we set NLBUFSIZ large enough for 8192 netlink headers (128kiB in total), and reference netlink(7). However netlink(7) says nothing about reponse buffer sizes, and the documents which do reference 8192 bytes not 8192 headers. Update NLBUFSIZ to 64kiB with a more detailed rationale. Link: https://bugs.passt.top/show_bug.cgi?id=67 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>