passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	tcp: Unify part of spliced and non-spliced conn_from_sock path	David Gibson	2022-11-25	3	-50/+77
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In tcp_sock_handler() we split off to handle spliced sockets before checking anything else. However the first steps of the "new connection" path for each case are the same: allocate a connection entry and accept() the connection. Remove this duplication by making tcp_conn_from_sock() handle both spliced and non-spliced cases, with help from more specific tcp_tap_conn_from_sock and tcp_splice_conn_from_sock functions for the later stages which differ. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Separate helpers to create ns listening sockets	David Gibson	2022-11-25	3	-49/+93
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_sock_init*() can create either sockets listening on the host, or in the pasta network namespace (with @ns==1). There are, however, a number of differences in how these two cases work in practice though. "ns" sockets are only used in pasta mode, and they always lead to spliced connections only. The functions are also only ever called in "ns" mode with a NULL address and interface name, and it doesn't really make sense for them to be called any other way. Later changes will introduce further differences in behaviour between these two cases, so it makes more sense to use separate functions for creating the ns listening sockets than the regular external/host listening sockets. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unify the IN_EPOLL flag	David Gibson	2022-11-25	3	-19/+19
\| \| \| \| \| \| \| \| \| \| \| \|	There is very little common between the tcp_tap_conn and tcp_splice_conn structures. However, both do have an IN_EPOLL flag which has the same meaning in each case, though it's stored in a different location. Simplify things slightly by moving this bit into the common header of the two structures. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Partially unify tcp_timer() and tcp_splice_timer()	David Gibson	2022-11-25	4	-42/+37
\| \| \| \| \| \| \| \| \| \|	These two functions scan all the non-splced and spliced connections respectively and perform timed updates on them. Avoid scanning the now unified table twice, by having tcp_timer scan it once calling the relevant per-connection function for each one. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unify tcp_defer_handler and tcp_splice_defer_handler()	David Gibson	2022-11-25	4	-33/+13
\| \| \| \| \| \| \| \| \| \| \|	These two functions each step through non-spliced and spliced connections respectively and clean up entries for closed connections. To avoid scanning the connection table twice, we merge these into a single function which scans the unified table and performs the appropriate sort of cleanup action on each one. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Unify spliced and non-spliced connection tables	David Gibson	2022-11-25	5	-71/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently spliced and non-spliced connections are stored in completely separate tables, so there are completely independent limits on the number of spliced and non-spliced connections. This is a bit counter-intuitive. More importantly, the fact that the tables are separate prevents us from unifying some other logic between the two cases. So, merge these two tables into one, using the 'c.spliced' common field to distinguish between them when necessary. For now we keep a common limit of 128k connections, whether they're spliced or non-spliced, which means we save memory overall. If necessary we could increase this to a 256k or higher total, which would cost memory but give some more flexibility. For now, the code paths which need to step through all extant connections are still separate for the two cases, just skipping over entries which aren't for them. We'll improve that in later patches. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Improved helpers to update connections after moving	David Gibson	2022-11-25	2	-10/+23
\| \| \| \| \| \| \| \| \| \|	When we compact the connection tables (both spliced and non-spliced) we need to move entries from one slot to another. That requires some updates in the entries themselves. Add helpers to make all the necessary updates for the spliced and non-spliced cases. This will simplify later cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Add connection union type	David Gibson	2022-11-25	3	-0/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, the tables for spliced and non-spliced connections are entirely separate, with different types in different arrays. We want to unify them. As a first step, create a union type which can represent either a spliced or non-spliced connection. For them to be distinguishable, the individual types need to have a common header added, with a bit indicating which type this structure is. This comes at the cost of increasing the size of tcp_tap_conn to over one (64 byte) cacheline. This isn't ideal, but it makes things simpler for now and we'll re-optimize this later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Move connection state structures into a shared header	David Gibson	2022-11-25	4	-225/+245
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently spliced and non-spliced connections use completely independent tracking structures. We want to unify these, so as a preliminary step move the definitions for both variants into a new tcp_conn.h header, shared by tcp.c and tcp_splice.c. This requires renaming some #defines with the same name but different meanings between the two cases. In the process we correct some places that are slightly out of sync between the comments and the code for various event bit names. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Helpers for converting from index to/from tcp_splice_conn	David Gibson	2022-11-25	1	-18/+25
\| \| \| \| \| \| \| \| \| \| \|	Like we already have for non-spliced connections, create a CONN_IDX() macro for looking up the index of spliced connection structures. Change the name of the array of spliced connections to be different from that for non-spliced connections (even though they're in different modules). This will make subsequent changes a bit safer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Better helpers for converting between connection pointer and index	David Gibson	2022-11-25	1	-38/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The macro CONN_OR_NULL() is used to look up connections by index with bounds checking. Replace it with an inline function, which means: - Better type checking - No danger of multiple evaluation of an @index with side effects Also add a helper to perform the reverse translation: from connection pointer to index. Introduce a macro for this which will make later cleanups easier and safer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove unused TCP_MAX_SOCKS constant	David Gibson	2022-11-25	1	-1/+0
\| \| \| \| \| \| \|	Presumably it meant something in the past, but it's no longer used. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: #include tcp_splice.h in tcp_splice.c	David Gibson	2022-11-25	2	-4/+1
\| \| \| \| \| \| \| \| \|	This obvious include was omitted, which means that declarations in the header weren't checked against definitions in the .c file. This shows up an old declaration for a function that is now static, and a duplicate Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	style: Minor corrections to function comments	David Gibson	2022-11-25	2	-6/+6
\| \| \| \| \| \| \|	Some style issues and a typo. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	clang-tidy: Suppress warning about assignments in if statements	David Gibson	2022-11-25	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \|	clang-tools 15.0.0 appears to have added a new warning that will always complain about assignments in if statements, which we use in a number of places in passt/pasta. Encountered on Fedora 37 with clang-tools-extra-15.0.0-3.fc37.x86_64. Suppress the new warning so that we can compile and test. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Add link to weekly development meeting	Stefano Brivio	2022-11-21	1	-0/+4
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Fix left-over and indentation for Podman example command	Stefano Brivio	2022-11-16	1	-5/+3
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: The upcoming version of Podman adds support for pasta2022_11_16.ace074c	Stefano Brivio	2022-11-16	1	-3/+7
\| \| \| \| \| \| \|	Don't update the demo yet: we could just wait until Podman 4.3.2 is out and packaged. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util, pasta: Add do_clone() wrapper around __clone2() and clone()	Stefano Brivio	2022-11-16	3	-8/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Spotted in Debian's buildd logs: on ia64, clone(2) is not available: the glibc wrapper is named __clone2() and it takes, additionally, the size of the stack area passed by the caller. Add a do_clone() wrapper handling the different cases, and also taking care of pointing the child's stack in the middle of the allocated area: on PA-RISC (hppa), handled by clone(), the stack grows up, and on ia64 the stack grows down, but the register backing store grows up -- and I think it might be actually used here. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/lib/test: Clean up iperf3 JSON files before starting the server	Stefano Brivio	2022-11-16	1	-2/+2
\| \| \| \| \| \| \| \|	...instead of doing it after the test. Now that we have pre-built guest images, we might also have old JSON files from previous, interrupted test runs. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Revert recently added checks in tap_handler_passt()	Stefano Brivio	2022-11-16	1	-16/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This reverts commit 198f87835dc4 ("tap: Return -EIO from tap_handler_passt() on inconsistent packet stream") and commit 510dace86ccf ("tap: Keep stream consistent if qemu length descriptor spans two recv() calls"). I can hit occasional failures in perf/passt_tcp tests where we seem to be getting excess data at the end of a recv(), and for some reason I couldn't figure out yet, if we just ignore it, subsequent recv() calls from qemu return correct data. If we close the connection, qemu can't talk to us anymore, of course. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	arp, tap, util: Don't use perror() after seccomp filter is installed	Stefano Brivio	2022-11-16	3	-8/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If stderr is closed, after we fork to background, glibc's implementation of perror() will try to re-open it by calling dup(), upon which the seccomp filter causes the process to terminate, because dup() is not included in the list of allowed syscalls. Replace perror() calls that might happen after isolation_postfork(). We could probably replace all of them, but early ones need a bit more attention as we have to check whether log.c functions work in early stages. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Remove contrib/debian, Debian package development now happens on Salsa	Stefano Brivio	2022-11-16	6	-63/+0
\| \| \| \| \| \| \| \| \|	The development of the Debian package is now at: https://salsa.debian.org/sbrivio/passt Drop contrib/debian, it's finally obsolete. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	contrib/apparmor: Merge pasta and passt profiles, update rules	Stefano Brivio	2022-11-16	2	-88/+51
\| \| \| \| \| \| \| \| \| \| \| \| \|	AppArmor resolves executable links before profile attachment rules are evaluated, so, as long as pasta is installed as a link to passt, there's no way to differentiate the two cases. Merge the two profiles and leave a TODO note behind, explaining two possible ways forward. Update the rules so that passt and pasta are actually usable, once the profile is installed. Most required changes are related to isolation and sandboxing features. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	README: Add links to Debian package tracker	Stefano Brivio	2022-11-16	1	-7/+10
\| \| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Makefile: Change HPPA into PARISC while building PASST_AUDIT_ARCH	Stefano Brivio	2022-11-16	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	The AUDIT_ARCH defines in seccomp.h corresponding to HPPA are AUDIT_ARCH_PARISC and AUDIT_ARCH_PARISC64. Build error spotted in Debian's buildd log on phantom.physik.fu-berlin.de. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Makefile: It's AUDIT_ARCH_MIPSEL64, not AUDIT_ARCH_MIPS64EL	Stefano Brivio	2022-11-16	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	On mips64el, gcc -dumpmachine correctly reports mips64el as architecture prefix, but for some reason seccomp.h defines AUDIT_ARCH_MIPSEL64 and not AUDIT_ARCH_MIPS64EL. Mangle AUDIT_ARCH accordingly. Build error spotted in Debian's buildd logs from Loongson build. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Makefile: Don't filter out -O2 from supplied flags for AVX2 builds	Stefano Brivio	2022-11-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Drop it from the internal FLAGS variable, but honour -O2 if passed in CFLAGS. In Debian packages, dpkg-buildflags uses it as hardening flag, and we get a QA warning if we drop it: https://qa.debian.org/bls/bytag/W-dpkg-buildflags-missing.html Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Makefile: Honour passed CPPFLAGS, not just CFLAGS	Stefano Brivio	2022-11-16	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \|	CPPFLAGS allow the user to pass pre-processor flags. This is unlikely to be needed at the moment, but the Debian Hardening Walkthrough reasonably requests it to be handled in order to fully support hardened build flags: https://wiki.debian.org/HardeningWalkthrough#Handling_dpkg-buildflags_in_your_upstream_build_system Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, udp: Drop mostly duplicated dns_send arrays, rename related fields	Stefano Brivio	2022-11-16	6	-71/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Given that we use just the first valid DNS resolver address configured, or read from resolv.conf(5) on the host, to forward DNS queries to, in case --dns-forward is used, we don't need to duplicate dns[] to dns_send[]: - rename dns_send[] back to dns[]: those are the resolvers we advertise to the guest/container - for forwarding purposes, instead of dns[], use a single field (for each protocol version): dns_host - and rename dns_fwd to dns_match, so that it's clear this is the address we are matching DNS queries against, to decide if they need to be forwarded Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Fix mask calculation from prefix_len in conf_print()2022_11_10.4129764	Stefano Brivio	2022-11-10	1	-1/+4
\| \| \| \| \| \| \|	Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: dd09cceaee21 ("Minor improvements to IPv4 netmask handling") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp, udp: Don't initialise IPv6/IPv4 sockets if IPv4/IPv6 are not enabled	Stefano Brivio	2022-11-10	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we disable a given IP version automatically (no corresponding default route on host) or administratively (--ipv4-only or --ipv6-only options), we don't initialise related buffers and services (DHCP for IPv4, NDP and DHCPv6 for IPv6). The "tap" handlers will also ignore packets with a disabled IP version. However, in commit 3c6ae625101a ("conf, tcp, udp: Allow address specification for forwarded ports") I happily changed socket initialisation functions to take AF_UNSPEC meaning "any enabled IP version", but I forgot to add checks back for the "enabled" part. Reported by Paul: on a host without default IPv6 route, but IPv6 enabled, connect, using IPv6, to a port handled by pasta, which tries to send data to a tap device without initialised buffers for that IP version and exits because the resulting write() fails. Simpler way to reproduce: pasta -6 and inbound IPv4 connection, or pasta -4 and inbound IPv6 connection. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 3c6ae625101a ("conf, tcp, udp: Allow address specification for forwarded ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	passt: Move __setlogmask() calls before output unrelated to configuration	Stefano Brivio	2022-11-10	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \|	...so that we avoid printing some lines twice because log-level is still set to LOG_EMERG, as if logging configuration didn't happen yet. While at it, note that logging to stderr doesn't really depend on whether debug mode is enabled or not. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Return -EIO from tap_handler_passt() on inconsistent packet stream	Stefano Brivio	2022-11-10	1	-2/+2
\| \| \| \| \| \| \| \|	While it's important to fail in that case, it makes little sense to fail quietly: it's better to tell qemu explicitly that something went wrong and that we won't recover, by closing the socket. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Keep stream consistent if qemu length descriptor spans two recv() calls	Stefano Brivio	2022-11-10	1	-3/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I got all paranoid after triggering a divide-by-zero general protection fault in passt with a qemu version without the virtio_net TX hang fix, while flooding UDP. I start thinking this was actually coming from some random changes I was playing with, but before reaching this conclusion I reviewed once more the relatively short path in tap_handler_passt() before we start using packet_*() functions, and found this. Never observed in practice, but artificially reproduced with changes in qemu's socket interface: if we don't receive from qemu a complete length descriptor in one recv() call, or if we receive a partial one at the end of one call, we currently disregard the rest, which would make the stream inconsistent. Nothing really bad happens, except that from that point on we would disregard all the packets we get until, if ever, we get the stream back in sync by chance. Force reading a complete packet length descriptor with a blocking recv(), if needed -- not just a complete packet later. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/memory/passt: Change passt.avx2 path to /bin in test itself2022_11_04.e308018	Stefano Brivio	2022-11-04	1	-2/+2
\| \| \| \| \| \| \| \|	Now that we install the binary in /bin, and we have a link from /usr/bin, change the path in the test itself as well. Otherwise it works with bash but not with dash for some reason. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt, qrap, README: Update notes and documentation for AF_UNIX support in qemu	Stefano Brivio	2022-11-04	5	-28/+18
\| \| \| \| \| \| \| \|	We can't get rid of qrap quite yet, but at least we should start telling users it's not going to be needed anymore starting from qemu 7.2. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Finally drop workaround for virtio_net TX stall	Stefano Brivio	2022-11-04	2	-30/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Now that we require 13c6be96618c ("net: stream: add unix socket") in qemu to run the tests, we can also assume that commit df8d07081718 ("virtio-net: fix bottom-half packet TX on asynchronous completion") is present, as it was merged before that one. This fixes the issue we attempted to work around in passt TCP and UDP performance tests: finally drop that stuff. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Switch to qemu -netdev stream option instead of using qrap	Stefano Brivio	2022-11-04	7	-77/+75
\| \| \| \| \| \| \| \| \| \| \|	qemu commit 13c6be96618c ("net: stream: add unix socket") introduces support for native AF_UNIX support, finally making qrap useless. We can't quite drop that yet until a qemu release includes it, and then we'll need to wait a while for users to switch anyway, but at least for tests, we can use that support. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Wait for network before starting passt in two_guests setup	Stefano Brivio	2022-11-04	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	As pasta now configures that target network namespace with --config-net, we need to wait for addresses and routes to be actually present. Just sending netlink messages doesn't mean this is done synchronously. A more elegant alternative, which probably makes sense regardless of this test setup, would be to query, from pasta, addresses and routes we added, and wait until they're there, before proceeding. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Check for answers to forwarded DNS queries before handling local redirects	Stefano Brivio	2022-11-04	1	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \|	Now that we allow loopback DNS addresses to be used as targets for forwarding, we need to check if DNS answers come from those targets, before deciding to eventually remap traffic for local redirects. Otherwise, the source address won't match the one configured as forwarder, which means that the guest or the container will refuse those responses. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Split the notions of read DNS addresses and offered ones	Stefano Brivio	2022-11-04	5	-23/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With --dns-forward, if the host has a loopback address configured as DNS server, we should actually use it to forward queries, but, if --no-map-gw is passed, we shouldn't offer the same address via DHCP, NDP and DHCPv6, because it's not going to be reachable. Problematic configuration: * systemd-resolved configuring the usual 127.0.0.53 on the host: we read that from /etc/resolv.conf * --dns-forward specified with an unrelated address, for example 198.51.100.1 We still want to forward queries to 127.0.0.53, if we receive one directed to 198.51.100.1, so we can't drop 127.0.0.53 from our list: we want to use it for forwarding. At the same time, we shouldn't offer 127.0.0.53 to the guest or container either. With this change, I'm only covering the case of automatically configured DNS servers from /etc/resolv.conf. We could extend this to addresses configured with command-line options, but I don't really see a likely use case at this point. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway	Stefano Brivio	2022-11-04	1	-1/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Seen in a Google Compute Engine environment with a machine configured via cloud-init-dhcp, while testing Podman integration for pasta: the assigned address has a /32 netmask, and there's a default route, which can be added on the host because there's another route, also /32, pointing to the default gateway. For example, on the host: ip -4 address add 10.156.0.2/32 dev eth0 ip -4 route add 10.156.0.1/32 dev eth0 ip -4 route add default via 10.156.0.1 This is not a valid configuration as far as I can tell: if the address is configured as /32, it shouldn't be used to reach a gateway outside its derived netmask. However, Linux allows that, and everything works. The problem comes when pasta --config-net sources address and default route from the host, and it can't configure the route in the target namespace because the gateway is invalid. That is, we would skip configuring the first route in the example, which results in the equivalent of doing: ip -4 address add 10.156.0.2/32 dev eth0 ip -4 route add default via 10.156.0.1 where, at this point, 10.156.0.1 is unreachable, and hence invalid as a gateway. Sourcing more routes than just the default is doable, but probably undesirable: pasta users want to provide connectivity to a container, not reflect exactly whatever trickery is configured on the host. Add a consistency check and an adjustment: if the configured default gateway is not reachable, shrink the given netmask until we can reach it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Correct function comments for address types	David Gibson	2022-11-04	1	-6/+6
\| \| \| \| \| \| \| \| \|	A number of functions describe themselves as taking a pointer to 'sin_addr or sin6_addr'. Those are field names, not type names. Replace them with the correct type names, in_addr or in6_addr. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use endian-safer typing in struct tap4_l4_t	David Gibson	2022-11-04	1	-15/+16
\| \| \| \| \| \| \| \| \| \| \| \|	We recently converted to using struct in_addr rather than bare in_addr_t or uint32_t to represent IPv4 addresses in network order. This makes it harder forget to apply the correct endian conversions. We omitted the IPv4 addresses stored in struct tap4_l4_t, however. Convert those as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use typing to reduce chances of IPv4 endianness errors	David Gibson	2022-11-04	14	-100/+113
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We recently corrected some errors handling the endianness of IPv4 addresses. These are very easy errors to make since although we mostly store them in network endianness, we sometimes need to manipulate them in host endianness. To reduce the chances of making such mistakes again, change to always using a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network endian addresses. This makes it harder to accidentally do arithmetic or comparisons on such addresses as if they were host endian. We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to directly work with struct in_addr values. This has the additional benefit of making the IPv4 and IPv6 paths more visually similar. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use IPV4_IS_LOOPBACK more widely	David Gibson	2022-11-04	2	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	This macro checks if an IPv4 address is in the loopback network (127.0.0.0/8). There are two places where we open code an identical check, use the macro instead. There are also a number of places we specifically exclude the loopback address (127.0.0.1), but we should actually be excluding anything in the loopback network. Change those sites to use the macro as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Minor improvements to IPv4 netmask handling	David Gibson	2022-11-04	4	-33/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are several minor problems with our parsing of IPv4 netmasks (-n). First, we don't reject nonsensical netmasks like 0.255.0.255. Address this structurally by using prefix length instead of netmask as the primary variable, only converting (and validating) when we need to. This has the added benefit of making some things more uniform with the IPv6 path. Second, when the user specifies a prefix length, we truncate the output from strtol() to an integer, which means we would treat -n 4294967320 as valid (equivalent to 24). Fix types to check for this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Correct some missing endian conversions of IPv4 addresses	David Gibson	2022-11-04	2	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The INADDR_LOOPBACK constant is in host endianness, and similarly the IN_MULTICAST macro expects a host endian address. However, there are some places in passt where we use those with network endian values. This means that passt will incorrectly allow you to set 127.0.0.1 or a multicast address as the guest address or DNS forwarding address. Add the necessary conversions to correct this. INADDR_ANY and INADDR_BROADCAST logically behave the same way, although because they're palindromes it doesn't have an effect in practice. Change them to be logically correct while we're there, though. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Add memory/passt test cases	Stefano Brivio	2022-11-04	6	-1/+289
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These show a summary of memory usage in kernel and userspace with different port forwarding configurations, details of userspace usage using 'nm' (passt only uses statically allocated memory), and details of kernel memory from slab reporting facilities. This adds a new test image, mbuto.mem.img, with harcoded IPv4 and IPv6 addresses and routes, and just the tools we need to start and stop passt, to report from /proc/slabinfo, /proc/meminfo, and to print and parse symbol sizes using nm(1). passt can't pivot_root() for sandboxing purposes on ramfs, so we need to create another filesystem and chroot into it, first. We don't want to use pane context functions, as we're checking memory usage for sockets: resort to screen-scraping. Configure a dummy interface to provide passt with an appearance of working IPv4 and IPv6 connectivity, contributed by David. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>