passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	conf: Split the notions of read DNS addresses and offered ones	Stefano Brivio	2022-11-04	5	-23/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With --dns-forward, if the host has a loopback address configured as DNS server, we should actually use it to forward queries, but, if --no-map-gw is passed, we shouldn't offer the same address via DHCP, NDP and DHCPv6, because it's not going to be reachable. Problematic configuration: * systemd-resolved configuring the usual 127.0.0.53 on the host: we read that from /etc/resolv.conf * --dns-forward specified with an unrelated address, for example 198.51.100.1 We still want to forward queries to 127.0.0.53, if we receive one directed to 198.51.100.1, so we can't drop 127.0.0.53 from our list: we want to use it for forwarding. At the same time, we shouldn't offer 127.0.0.53 to the guest or container either. With this change, I'm only covering the case of automatically configured DNS servers from /etc/resolv.conf. We could extend this to addresses configured with command-line options, but I don't really see a likely use case at this point. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Adjust netmask on mismatch between IPv4 address/netmask and gateway	Stefano Brivio	2022-11-04	1	-1/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Seen in a Google Compute Engine environment with a machine configured via cloud-init-dhcp, while testing Podman integration for pasta: the assigned address has a /32 netmask, and there's a default route, which can be added on the host because there's another route, also /32, pointing to the default gateway. For example, on the host: ip -4 address add 10.156.0.2/32 dev eth0 ip -4 route add 10.156.0.1/32 dev eth0 ip -4 route add default via 10.156.0.1 This is not a valid configuration as far as I can tell: if the address is configured as /32, it shouldn't be used to reach a gateway outside its derived netmask. However, Linux allows that, and everything works. The problem comes when pasta --config-net sources address and default route from the host, and it can't configure the route in the target namespace because the gateway is invalid. That is, we would skip configuring the first route in the example, which results in the equivalent of doing: ip -4 address add 10.156.0.2/32 dev eth0 ip -4 route add default via 10.156.0.1 where, at this point, 10.156.0.1 is unreachable, and hence invalid as a gateway. Sourcing more routes than just the default is doable, but probably undesirable: pasta users want to provide connectivity to a container, not reflect exactly whatever trickery is configured on the host. Add a consistency check and an adjustment: if the configured default gateway is not reachable, shrink the given netmask until we can reach it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Correct function comments for address types	David Gibson	2022-11-04	1	-6/+6
\| \| \| \| \| \| \| \| \|	A number of functions describe themselves as taking a pointer to 'sin_addr or sin6_addr'. Those are field names, not type names. Replace them with the correct type names, in_addr or in6_addr. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use endian-safer typing in struct tap4_l4_t	David Gibson	2022-11-04	1	-15/+16
\| \| \| \| \| \| \| \| \| \| \| \|	We recently converted to using struct in_addr rather than bare in_addr_t or uint32_t to represent IPv4 addresses in network order. This makes it harder forget to apply the correct endian conversions. We omitted the IPv4 addresses stored in struct tap4_l4_t, however. Convert those as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use typing to reduce chances of IPv4 endianness errors	David Gibson	2022-11-04	14	-100/+113
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We recently corrected some errors handling the endianness of IPv4 addresses. These are very easy errors to make since although we mostly store them in network endianness, we sometimes need to manipulate them in host endianness. To reduce the chances of making such mistakes again, change to always using a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network endian addresses. This makes it harder to accidentally do arithmetic or comparisons on such addresses as if they were host endian. We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to directly work with struct in_addr values. This has the additional benefit of making the IPv4 and IPv6 paths more visually similar. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use IPV4_IS_LOOPBACK more widely	David Gibson	2022-11-04	2	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	This macro checks if an IPv4 address is in the loopback network (127.0.0.0/8). There are two places where we open code an identical check, use the macro instead. There are also a number of places we specifically exclude the loopback address (127.0.0.1), but we should actually be excluding anything in the loopback network. Change those sites to use the macro as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Minor improvements to IPv4 netmask handling	David Gibson	2022-11-04	4	-33/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are several minor problems with our parsing of IPv4 netmasks (-n). First, we don't reject nonsensical netmasks like 0.255.0.255. Address this structurally by using prefix length instead of netmask as the primary variable, only converting (and validating) when we need to. This has the added benefit of making some things more uniform with the IPv6 path. Second, when the user specifies a prefix length, we truncate the output from strtol() to an integer, which means we would treat -n 4294967320 as valid (equivalent to 24). Fix types to check for this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Correct some missing endian conversions of IPv4 addresses	David Gibson	2022-11-04	2	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The INADDR_LOOPBACK constant is in host endianness, and similarly the IN_MULTICAST macro expects a host endian address. However, there are some places in passt where we use those with network endian values. This means that passt will incorrectly allow you to set 127.0.0.1 or a multicast address as the guest address or DNS forwarding address. Add the necessary conversions to correct this. INADDR_ANY and INADDR_BROADCAST logically behave the same way, although because they're palindromes it doesn't have an effect in practice. Change them to be logically correct while we're there, though. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Add memory/passt test cases	Stefano Brivio	2022-11-04	6	-1/+289
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These show a summary of memory usage in kernel and userspace with different port forwarding configurations, details of userspace usage using 'nm' (passt only uses statically allocated memory), and details of kernel memory from slab reporting facilities. This adds a new test image, mbuto.mem.img, with harcoded IPv4 and IPv6 addresses and routes, and just the tools we need to start and stop passt, to report from /proc/slabinfo, /proc/meminfo, and to print and parse symbol sizes using nm(1). passt can't pivot_root() for sandboxing purposes on ramfs, so we need to create another filesystem and chroot into it, first. We don't want to use pane context functions, as we're checking memory usage for sockets: resort to screen-scraping. Configure a dummy interface to provide passt with an appearance of working IPv4 and IPv6 connectivity, contributed by David. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/lib: Add "td" directive, handled by table_value()	Stefano Brivio	2022-11-04	2	-0/+31
\| \| \| \| \| \| \|	This can be used for generic cell values with an arbitrary scale. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/lib/perf_report: Use own flag to track initialisation	Stefano Brivio	2022-11-04	1	-3/+6
\| \| \| \| \| \| \| \| \|	Instead of just disabling performance reports if running in demo mode. This allows us to use table functions outside of performance reports. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Support for detection of existing sockets on ramfs	Stefano Brivio	2022-11-04	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On ramfs, connecting to a non-existent UNIX domain socket yields EACCESS, instead of ENOENT. This is visible if we use passt directly on rootfs (a ramfs instance) from an initramfs image. It's probably wrong for ramfs to return EACCES, but given the simplicity of the filesystem, I doubt we should try to fix it there at the possible cost of added complexity. Also, this whole beauty should go away once qrap-less usage is established, so just accept EACCES as indication that a conflicting socket does not, in fact, exist. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/lib: Move screen-scraping setup and layout functions to _ugly files	Stefano Brivio	2022-11-04	5	-92/+123
\| \| \| \| \| \| \| \| \| \| \|	I'm going to add yet another one of those, for which I have no quick solution. It's a regression in some sense, but at least if we make this regression more observable and defined, it should be easier to find a comprehensive solution later, within this or another testing framework. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	README: Add Podman, vhost-user links, and links to Bugzilla queries	Stefano Brivio	2022-10-27	1	-4/+8
\| \| \| \| \| \| \| \| \| \|	Unfortunately Bugzilla doesn't enable sharing of queries to unregistered users: https://bugzilla.mozilla.org/show_bug.cgi?id=400063 ...but we can still use ugly search links. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt.1: Fix typo: "addressses", reported by Lintian	Stefano Brivio	2022-10-27	1	-1/+1
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Don't discard first reply sequence for a given echo ID2022_10_26.f212044	Stefano Brivio	2022-10-27	3	-2/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pasta mode, ICMP and ICMPv6 echo sockets relay back to us any reply we send: we're on the same host as the target, after all. We discard them by comparing the last sequence we sent with the sequence we receive. However, on the first reply for a given identifier, the sequence might be zero, depending on the implementation of ping(8): we need another value to indicate we haven't sent any sequence number, yet. Use -1 as initialiser in the echo identifier map. This is visible with Busybox's ping, and was reported by Paul on the integration at https://github.com/containers/podman/pull/16141, with: $ podman run --net=pasta alpine ping -c 2 192.168.188.1 ...where only the second reply would be routed back. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 33482d5bf293 ("passt: Add PASTA mode, major rework") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	icmp: Add debugging messages for handled replies and requests	Stefano Brivio	2022-10-27	1	-5/+25
\| \| \| \| \| \| \|	...instead of just reporting errors. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Trace received (outbound) ICMP packets in debug mode, too	Stefano Brivio	2022-10-27	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	This only worked for ICMPv6: ICMP packets have no TCP-style header, so they are handled as a special case before packet sequences are formed, and the call to tap_packet_debug() was missing. Fixes: bb708111833e ("treewide: Packet abstraction with mandatory boundary checks") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, passt.1: Don't imply --foreground with --debug	Stefano Brivio	2022-10-27	2	-7/+5
\| \| \| \| \| \| \| \| \| \|	Having -f implied by -d (and --trace) usually saves some typing, but debug mode in background (with a log file) is quite useful if pasta is started by Podman, and is probably going to be handy for passt with libvirt later, too. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/run: Temporarily disable distribution tests2022_10_26.e4df8b0	Stefano Brivio	2022-10-26	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \|	They're too slow to cope with current release cycles, and they haven't found bugs in months, also because clang-tidy and cppcheck would find most of them earlier. Disable them for the moment. We should pre-install gcc and make in non-x86 images, as those run on my test machine with qemu TCG, and that's the real slow-down here. Then we can re-enable them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	hooks: Temporarily disable demo generation in pre-push	Stefano Brivio	2022-10-26	1	-4/+11
\| \| \| \| \| \| \| \| \| \| \|	The out-of-tree Podman patch needs to be rebased every second week or so, and I'm currently trying to get that upstream: https://github.com/containers/podman/pull/16141 Disable demo generation for the moment, so that I avoid wasting time with those rebases. We'll re-enable it later. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Add log file tests for pasta plus corresponding layout and setup	Stefano Brivio	2022-10-26	5	-1/+150
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To test log files on a tmpfs mount, we need to unshare the mount namespace, which means using a context for the passt pane is not really practical at the moment, as we can't open a shell there, so we would have to encapsulate all the commands under 'unshare -rUm', plus the "inner" pasta command, running in turn a tcp_rr server. It might be worth fixing this by e.g. detecting we are trying to spawn an interactive shell and adding a special path in the context setup with some form of stdin redirection -- I'm not sure it's doable though. For this reason, add a new layout, using a context only for the host pane, while keeping the old command dispatch mechanism for the passt pane. We also need a new setup function that doesn't start pasta: we want to start and restart it with different options. Further, we need a 'pint' directive, to send an interrupt to the passt pane: add that in lib/test. All the tests before the one involving tmpfs and a detached mount namespace were also tested with the context mechanism. To make an eventual conversion easier, pass tcp_crr directly as a command on pasta's command line where feasible. While at it, fix the comment to the teardown_pasta() function. The new test set can be semi-conveniently run as: ./run pasta_options/log_to_file and it checks basic log creation, size of the log file after flooding it with debug entries, rotations, and basic consistency after rotations, on both an existing filesystem and a tmpfs, chosen as it doesn't support collapsing data ranges via fallocate(), hence triggering the fall-back mechanism for logging rotation. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: Fix calculation for ICMP checksum on IPv4	Stefano Brivio	2022-10-26	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \|	We need to zero out the checksum field before calculating the checksum, of course. I have no idea how this passed the "icmp" test set, looking into it. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 67ab6171729c ("Add csum_icmp4() helper for calculating ICMP checksums") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Don't pass leading ~ to parse_port_range() on exclusions2022_10_24.c11277b	Stefano Brivio	2022-10-24	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 84fec4e998b6 ("Clean up parsing of port ranges") drops the strspn() call before the parsing of excluded port ranges, because now we're checking against any stray characters at every step. However, that also has the effect of passing ~ as first character to the new parse_port_range(), which makes no sense: we already checked that ~ is the first character before the call, so skip it. Alona reported this output: Invalid port specifier ~15000,~15001,~15006,~15008,~15020,~15021,~15090 while the whole specifier is indeed valid. Reported-by: Alona Paz <alkaplan@redhat.com> Fixes: 84fec4e998b6 ("Clean up parsing of port ranges") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size2022_10_22.b68da10	Stefano Brivio	2022-10-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...instead of one fourth. On the main() -> conf() -> nl_sock_init() call path, LTO from gcc 12 on (at least) x86_64 decides to inline... everything: nl_sock_init() is effectively part of main(), after commit 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()"). This means we exceed the maximum stack size, and we get SIGSEGV, under any condition, at start time, as reported by Andrea on a recent build for CentOS Stream 9. The calculation of NS_FN_STACK_SIZE, which is the stack size we reserve for clones, was previously obtained by dividing the maximum stack size by two, to avoid an explicit check on architecture (on PA-RISC, also known as hppa, the stack grows up, so we point the clone to the middle of this area), and then further divided by two to allow for any additional usage in the caller. Well, if there are essentially no function calls anymore, this is not enough. Divide it by eight, which is anyway much more than possibly needed by any clone()d callee. I think this is robust, so it's a fix in some sense. Strictly speaking, though, we have no formal guarantees that this isn't either too little or too much. What we should do, eventually: check cloned() callees, there are just thirteen of them at the moment. Note down any stack usage (they are mostly small helpers), bonus points for an automated way at build time, quadruple that or so, to allow for extreme clumsiness, and use as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. Reported-by: Andrea Bolognani <abologna@redhat.com> Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add git-publish configuration file	Andrea Bolognani	2022-10-22	1	-0/+3
\| \| \| \| \| \|	Signed-off-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	qrap: Support JSON syntax for -device	Andrea Bolognani	2022-10-21	1	-10/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Starting with version 8.1.0, libvirt uses JSON syntax when generating the arguments to -device, so they will now look like {"driver":"virtio-scsi-pci","bus":"pci.3","addr":"0x0"} instead of virtio-scsi-pci,bus=pci.3,addr=0x0 qrap needs to parse these arguments and extract the bus number in order to figure out what address to use for the virtio-net device it adds, and the libvirt change described above has broken this parsing logic. Tweak the code so that both styles are accepted and handled correctly. Note that, when JSON is in use, qrap needs to generate its own command line options in that format as well or things will not work as expected. Signed-off-by: Andrea Bolognani <abologna@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	dhcp: Use tap_udp4_send() helper in dhcp()	David Gibson	2022-10-19	2	-17/+2
\| \| \| \| \| \| \| \| \| \|	The IPv4 specific dhcp() manually constructs L2 and IP headers to send its DHCP reply packet, unlike its IPv6 equivalent in dhcpv6.c which uses the tap_udp6_send() helper. Now that we've broaded the parameters to tap_udp4_send() we can use it in dhcp() to avoid some duplicated logic. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Split tap_ip4_send() into UDP and ICMP variants	David Gibson	2022-10-19	3	-21/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tap_ip4_send() has special case logic to compute the checksums for UDP and ICMP packets, which is a mild layering violation. By using a suitable helper we can split it into tap_udp4_send() and tap_icmp4_send() functions without greatly increasing the code size, this removing that layering violation. We make some small changes to the interface while there. In both cases we make the destination IPv4 address a parameter, which will be useful later. For the UDP variant we make it take just the UDP payload, and it will generate the UDP header. For the ICMP variant we pass in the ICMP header as before. The inconsistency is because that's what seems to be the more natural way to invoke the function in the callers in each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	ndp: Use tap_icmp6_send() helper	David Gibson	2022-10-19	1	-17/+4
\| \| \| \| \| \| \| \| \| \| \|	We send ICMPv6 packets to the guest from both icmp.c and from ndp.c. The case in ndp() manually constructs L2 and IPv6 headers, unlike the version in icmp.c which uses the tap_icmp6_send() helper from tap.c Now that we've broaded the parameters of tap_icmp6_send() we can use it in ndp() as well saving some duplicated logic. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	ndp: Remove unneeded eh_source parameter	David Gibson	2022-10-19	3	-7/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ndp() takes a parameter giving the ethernet source address of the packet it is to respond to, which it uses to determine the destination address to send the reply packet to. This is not necessary, because the address will always be the guest's MAC address. Even if the guest has just changed MAC address, then either tap_handler_passt() or tap_handler_pasta() - which are the only call paths leading to ndp() will have updated c->mac_guest with the new value. So, remove the parameter, and just use c->mac_guest, making it more consistent with other paths where we construct packets to send inwards. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Split tap_ip6_send() into UDP and ICMP variants	David Gibson	2022-10-19	4	-40/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tap_ip6_send() has special case logic to compute the checksums for UDP and ICMP packets, which is a mild layering violation. By using a suitable helper we can split it into tap_udp6_send() and tap_icmp6_send() functions without greatly increasing the code size, this removing that layering violation. We make some small changes to the interface while there. In both cases we make the destination IPv6 address a parameter, which will be useful later. For the UDP variant we make it take just the UDP payload, and it will generate the UDP header. For the ICMP variant we pass in the ICMP header as before. The inconsistency is because that's what seems to be the more natural way to invoke the function in the callers in each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Split tap_ip_send() into IPv4 and IPv6 specific functions	David Gibson	2022-10-19	4	-96/+103
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The IPv4 and IPv6 paths in tap_ip_send() have very little in common, and it turns out that every caller (statically) knows if it is using IPv4 or IPv6. So split into separate tap_ip4_send() and tap_ip6_send() functions. Use a new tap_l2_hdr() function for the very small common part. While we're there, make some minor cleanups: - We were double writing some fields in the IPv6 header, so that it temporary matched the pseudo-header for checksum calculation. With recent checksum reworks, this isn't neccessary any more. - We don't use any IPv4 header options, so use some sizeof() constructs instead of some open coded values for header length. - The comment used to say that the flow label was for TCP over IPv6, but in fact the only thing we used it for was DHCPv6 over UDP traffic Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Remove unhelpeful vnet_pre optimization from tap_send()	David Gibson	2022-10-19	5	-24/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Callers of tap_send() can optionally use a small optimization by adding extra space for the 4 byte length header used on the qemu socket interface. tap_ip_send() is currently the only user of this, but this is used only for "slow path" ICMP and DHCP packets, so there's not a lot of value to the optimization. Worse, having the two paths here complicates the interface and makes future cleanups difficult, so just remove it. I have some plans to bring back the optimization in a more general way in future, but for now it's just in the way. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Remove support for TCP packets from tap_ip_send()	David Gibson	2022-10-19	3	-44/+2
\| \| \| \| \| \| \| \| \| \| \|	tap_ip_send() is never used for TCP packets, we're unlikely to use it for that in future, and the handling of TCP packets makes other cleanups unnecessarily awkward. Remove it. This is the only user of csum_tcp4(), so we can remove that as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add helpers for normal inbound packet destination addresses	David Gibson	2022-10-19	2	-5/+31
\| \| \| \| \| \| \| \| \| \|	tap_ip_send() doesn't take a destination address, because it's specifically for inbound packets, and the IP addresses of the guest/namespace are already known to us. Rather than open-coding this destination address logic, make helper functions for it which will enable some later cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add csum_ip4_header() helper to calculate IPv4 header checksums	David Gibson	2022-10-19	4	-4/+13
\| \| \| \| \| \| \| \|	We calculate IPv4 header checksums in at least two places, in dhcp() and in tap_ip_send. Add a helper to handle this calculation in both places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add csum_udp4() helper for calculating UDP over IPv4 checksums	David Gibson	2022-10-19	4	-2/+37
\| \| \| \| \| \| \| \| \| \| \| \| \|	At least two places in passt fill in UDP over IPv4 checksums, although since UDP checksums are optional with IPv4 that just amounts to storing a 0 (in tap_ip_send()) or leaving a 0 from an earlier initialization (in dhcp()). For consistency, add a helper for this "calculation". Just for the heck of it, add the option (compile time disabled for now) to calculate real UDP checksums. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add csum_udp6() helper for calculating UDP over IPv6 checksums	David Gibson	2022-10-19	3	-3/+28
\| \| \| \| \| \| \| \| \| \| \|	Add a helper for calculating UDP checksums when used over IPv6 For future flexibility, the new helper takes parameters for the fields in the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to be explicitly constructed. It also allows the UDP header and payload to be in separate buffers, although we don't use this yet. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add csum_icmp4() helper for calculating ICMP checksums	David Gibson	2022-10-19	3	-3/+19
\| \| \| \| \| \| \| \| \| \|	Although tap_ip_send() is currently the only place calculating ICMP checksums, create a helper function for symmetry with ICMPv6. For future flexibility it allows the ICMPv6 header and payload to be in separate buffers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add csum_icmp6() helper for calculating ICMPv6 checksums	David Gibson	2022-10-19	4	-8/+33
\| \| \| \| \| \| \| \| \| \| \| \|	At least two places in passt calculate ICMPv6 checksums, ndp() and tap_ip_send(). Add a helper to handle this calculation in both places. For future flexibility, the new helper takes parameters for the fields in the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to be explicitly constructed. It also allows the ICMPv6 header and payload to be in separate buffers, although we don't use this yet. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt.1: Add David to AUTHORS2022_10_15.b3f3591	Stefano Brivio	2022-10-15	1	-2/+2
\| \| \| \| \| \| \|	I just realised while reading the man page. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()	Stefano Brivio	2022-10-15	5	-63/+98
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even if CAP_NET_BIND_SERVICE is granted, we'll lose the capability in the target user namespace as we isolate the process, which means we're unable to bind to low ports at that point. Bind inbound ports, and only those, before isolate_user(). Keep the handling of outbound ports (for pasta mode only) after the setup of the namespace, because that's where we'll bind them. To this end, initialise the netlink socket for the init namespace before isolate_user() as well, as we actually need to know the addresses of the upstream interface before binding ports, in case they're not explicitly passed by the user. As we now call nl_sock_init() twice, checking its return code from conf() twice looks a bit heavy: make it exit(), instead, as we can't do much if we don't have netlink sockets. While at it: - move the v4_only && v6_only options check just after the first option processing loop, as this is more strictly related to option parsing proper - update the man page, explaining that CAP_NET_BIND_SERVICE is not the preferred way to bind ports, because passt and pasta can be abused to allow other processes to make effective usage of it. Add a note about the recommended sysctl instead - simplify nl_sock_init_do() now that it's called once for each case Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Rename pasta_setup_ns() to pasta_spawn_cmd()	David Gibson	2022-10-15	1	-9/+9
\| \| \| \| \| \| \| \| \| \|	pasta_setup_ns() no longer has much to do with setting up a namespace. Instead it's really about starting the shell or other command we want to run with pasta connectivity. Rename it and its argument structure to be less misleading. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: Only configure UID/GID mappings in userns when spawning shell	David Gibson	2022-10-15	4	-16/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When in passt mode, or pasta mode spawning a command, we create a userns for ourselves. This is used both to isolate the pasta/passt process itself and to run the spawned command, if any. Since eed17a47 "Handle userns isolation and dropping root at the same time" we've handled both cases the same, configuring the UID and GID mappings in the new userns to map whichever UID we're running as to root within the userns. This mapping is desirable when spawning a shell or other command, so that the user gets a root shell with reasonably clear abilities within the userns and netns. It's not necessarily essential, though. When not spawning a shell, it doesn't really have any purpose: passt itself doesn't need to be root and can operate fine with an unmapped user (using some of the capabilities we get when entering the userns instead). Configuring the uid_map can cause problems if passt is running with any capabilities in the initial namespace, such as CAP_NET_BIND_SERVICE to allow it to forward low ports. In this case the kernel makes files in /proc/pid owned by root rather than the starting user to prevent the user from interfering with the operation of the capability-enhanced process. This includes uid_map meaning we are not able to write to it. Whether this behaviour is correct in the kernel is debatable, but in any case we might as well avoid problems by only initializing the user mappings when we really want them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: Prevent any child processes gaining capabilities	David Gibson	2022-10-15	1	-0/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We drop our own capabilities, but it's possible that processes we exec() could gain extra privilege via file capabilities. It shouldn't be possible for us to exec() anyway due to seccomp() and our filesystem isolation. But just in case, zero the bounding and inheritable capability sets to prevent any such child from gainin privilege. Note that we do this after spawning the pasta shell/command (if any), because we do want the user to be able to give that privilege if they want. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: Replace drop_caps() with a version that actually does something	David Gibson	2022-10-15	3	-11/+92
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current implementation of drop_caps() doesn't really work because it attempts to drop capabilities from the bounding set. That's not the set that really matters, it's about limiting the abilities of things we might later exec() rather than our own capabilities. It also requires CAP_SETPCAP which we won't usually have. Replace it with a new version which uses setcap(2) to drop capabilities from the effective and permitted sets. For now we leave the inheritable set as is, since we don't want to preclude the user from passing inheritable capabilities to the command spawed by pasta. Correctly dropping caps reveals that we were relying on some capabilities we'd supposedly dropped. Re-divide the dropping of capabilities between isolate_initial(), isolate_user() and isolate_prefork() to make this work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: Refactor isolate_user() to allow for a common exit path	David Gibson	2022-10-15	1	-24/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, isolate_user() exits early if the --netns-only option is given. That works for now, but shortly we're going to want to add some logic to go at the end of isolate_user() that needs to run in all cases: joining a given userns, creating a new userns, or staying in our original userns (--netns-only). To avoid muddying those changes, here we reorganize isolate_user() to have a common exit path for all cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Replace FWRITE with a function	David Gibson	2022-10-15	4	-22/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a few places we use the FWRITE() macro to open a file, replace it's contents with a given string and close it again. There's no real reason this needs to be a macro rather than just a function though. Turn it into a function 'write_file()' and make some ancillary cleanups while we're there: - Add a return code so the caller can handle giving a useful error message - Handle the case of short write()s (unlikely, but possible) - Add O_TRUNC, to make sure we replace the existing contents entirely Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: Clarify various self-isolation steps	David Gibson	2022-10-15	3	-13/+86
\| \| \| \| \| \| \| \| \| \|	We have a number of steps of self-isolation scattered across our code. Improve function names and add comments to make it clearer what the self isolation model is, what the steps do, and why they happen at the points they happen. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>