passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	icmp: Add debugging messages for handled replies and requests	Stefano Brivio	2022-10-27	1	-5/+25
\| \| \| \| \| \| \|	...instead of just reporting errors. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Trace received (outbound) ICMP packets in debug mode, too	Stefano Brivio	2022-10-27	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	This only worked for ICMPv6: ICMP packets have no TCP-style header, so they are handled as a special case before packet sequences are formed, and the call to tap_packet_debug() was missing. Fixes: bb708111833e ("treewide: Packet abstraction with mandatory boundary checks") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf, passt.1: Don't imply --foreground with --debug	Stefano Brivio	2022-10-27	2	-7/+5
\| \| \| \| \| \| \| \| \| \|	Having -f implied by -d (and --trace) usually saves some typing, but debug mode in background (with a log file) is quite useful if pasta is started by Podman, and is probably going to be handy for passt with libvirt later, too. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/run: Temporarily disable distribution tests2022_10_26.e4df8b0	Stefano Brivio	2022-10-26	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \|	They're too slow to cope with current release cycles, and they haven't found bugs in months, also because clang-tidy and cppcheck would find most of them earlier. Disable them for the moment. We should pre-install gcc and make in non-x86 images, as those run on my test machine with qemu TCG, and that's the real slow-down here. Then we can re-enable them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	hooks: Temporarily disable demo generation in pre-push	Stefano Brivio	2022-10-26	1	-4/+11
\| \| \| \| \| \| \| \| \| \| \|	The out-of-tree Podman patch needs to be rebased every second week or so, and I'm currently trying to get that upstream: https://github.com/containers/podman/pull/16141 Disable demo generation for the moment, so that I avoid wasting time with those rebases. We'll re-enable it later. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Add log file tests for pasta plus corresponding layout and setup	Stefano Brivio	2022-10-26	5	-1/+150
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To test log files on a tmpfs mount, we need to unshare the mount namespace, which means using a context for the passt pane is not really practical at the moment, as we can't open a shell there, so we would have to encapsulate all the commands under 'unshare -rUm', plus the "inner" pasta command, running in turn a tcp_rr server. It might be worth fixing this by e.g. detecting we are trying to spawn an interactive shell and adding a special path in the context setup with some form of stdin redirection -- I'm not sure it's doable though. For this reason, add a new layout, using a context only for the host pane, while keeping the old command dispatch mechanism for the passt pane. We also need a new setup function that doesn't start pasta: we want to start and restart it with different options. Further, we need a 'pint' directive, to send an interrupt to the passt pane: add that in lib/test. All the tests before the one involving tmpfs and a detached mount namespace were also tested with the context mechanism. To make an eventual conversion easier, pass tcp_crr directly as a command on pasta's command line where feasible. While at it, fix the comment to the teardown_pasta() function. The new test set can be semi-conveniently run as: ./run pasta_options/log_to_file and it checks basic log creation, size of the log file after flooding it with debug entries, rotations, and basic consistency after rotations, on both an existing filesystem and a tmpfs, chosen as it doesn't support collapsing data ranges via fallocate(), hence triggering the fall-back mechanism for logging rotation. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: Fix calculation for ICMP checksum on IPv4	Stefano Brivio	2022-10-26	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \|	We need to zero out the checksum field before calculating the checksum, of course. I have no idea how this passed the "icmp" test set, looking into it. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 67ab6171729c ("Add csum_icmp4() helper for calculating ICMP checksums") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Don't pass leading ~ to parse_port_range() on exclusions2022_10_24.c11277b	Stefano Brivio	2022-10-24	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 84fec4e998b6 ("Clean up parsing of port ranges") drops the strspn() call before the parsing of excluded port ranges, because now we're checking against any stray characters at every step. However, that also has the effect of passing ~ as first character to the new parse_port_range(), which makes no sense: we already checked that ~ is the first character before the call, so skip it. Alona reported this output: Invalid port specifier ~15000,~15001,~15006,~15008,~15020,~15021,~15090 while the whole specifier is indeed valid. Reported-by: Alona Paz <alkaplan@redhat.com> Fixes: 84fec4e998b6 ("Clean up parsing of port ranges") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Set NS_FN_STACK_SIZE to one eighth of ulimit-reported maximum stack size2022_10_22.b68da10	Stefano Brivio	2022-10-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...instead of one fourth. On the main() -> conf() -> nl_sock_init() call path, LTO from gcc 12 on (at least) x86_64 decides to inline... everything: nl_sock_init() is effectively part of main(), after commit 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()"). This means we exceed the maximum stack size, and we get SIGSEGV, under any condition, at start time, as reported by Andrea on a recent build for CentOS Stream 9. The calculation of NS_FN_STACK_SIZE, which is the stack size we reserve for clones, was previously obtained by dividing the maximum stack size by two, to avoid an explicit check on architecture (on PA-RISC, also known as hppa, the stack grows up, so we point the clone to the middle of this area), and then further divided by two to allow for any additional usage in the caller. Well, if there are essentially no function calls anymore, this is not enough. Divide it by eight, which is anyway much more than possibly needed by any clone()d callee. I think this is robust, so it's a fix in some sense. Strictly speaking, though, we have no formal guarantees that this isn't either too little or too much. What we should do, eventually: check cloned() callees, there are just thirteen of them at the moment. Note down any stack usage (they are mostly small helpers), bonus points for an automated way at build time, quadruple that or so, to allow for extreme clumsiness, and use as NS_FN_STACK_SIZE. Perhaps introduce a specific condition for hppa. Reported-by: Andrea Bolognani <abologna@redhat.com> Fixes: 3e2eb4337bc0 ("conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add git-publish configuration file	Andrea Bolognani	2022-10-22	1	-0/+3
\| \| \| \| \| \|	Signed-off-by: Andrea Bolognani <abologna@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	qrap: Support JSON syntax for -device	Andrea Bolognani	2022-10-21	1	-10/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Starting with version 8.1.0, libvirt uses JSON syntax when generating the arguments to -device, so they will now look like {"driver":"virtio-scsi-pci","bus":"pci.3","addr":"0x0"} instead of virtio-scsi-pci,bus=pci.3,addr=0x0 qrap needs to parse these arguments and extract the bus number in order to figure out what address to use for the virtio-net device it adds, and the libvirt change described above has broken this parsing logic. Tweak the code so that both styles are accepted and handled correctly. Note that, when JSON is in use, qrap needs to generate its own command line options in that format as well or things will not work as expected. Signed-off-by: Andrea Bolognani <abologna@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	dhcp: Use tap_udp4_send() helper in dhcp()	David Gibson	2022-10-19	2	-17/+2
\| \| \| \| \| \| \| \| \| \|	The IPv4 specific dhcp() manually constructs L2 and IP headers to send its DHCP reply packet, unlike its IPv6 equivalent in dhcpv6.c which uses the tap_udp6_send() helper. Now that we've broaded the parameters to tap_udp4_send() we can use it in dhcp() to avoid some duplicated logic. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Split tap_ip4_send() into UDP and ICMP variants	David Gibson	2022-10-19	3	-21/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tap_ip4_send() has special case logic to compute the checksums for UDP and ICMP packets, which is a mild layering violation. By using a suitable helper we can split it into tap_udp4_send() and tap_icmp4_send() functions without greatly increasing the code size, this removing that layering violation. We make some small changes to the interface while there. In both cases we make the destination IPv4 address a parameter, which will be useful later. For the UDP variant we make it take just the UDP payload, and it will generate the UDP header. For the ICMP variant we pass in the ICMP header as before. The inconsistency is because that's what seems to be the more natural way to invoke the function in the callers in each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	ndp: Use tap_icmp6_send() helper	David Gibson	2022-10-19	1	-17/+4
\| \| \| \| \| \| \| \| \| \| \|	We send ICMPv6 packets to the guest from both icmp.c and from ndp.c. The case in ndp() manually constructs L2 and IPv6 headers, unlike the version in icmp.c which uses the tap_icmp6_send() helper from tap.c Now that we've broaded the parameters of tap_icmp6_send() we can use it in ndp() as well saving some duplicated logic. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	ndp: Remove unneeded eh_source parameter	David Gibson	2022-10-19	3	-7/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ndp() takes a parameter giving the ethernet source address of the packet it is to respond to, which it uses to determine the destination address to send the reply packet to. This is not necessary, because the address will always be the guest's MAC address. Even if the guest has just changed MAC address, then either tap_handler_passt() or tap_handler_pasta() - which are the only call paths leading to ndp() will have updated c->mac_guest with the new value. So, remove the parameter, and just use c->mac_guest, making it more consistent with other paths where we construct packets to send inwards. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Split tap_ip6_send() into UDP and ICMP variants	David Gibson	2022-10-19	4	-40/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tap_ip6_send() has special case logic to compute the checksums for UDP and ICMP packets, which is a mild layering violation. By using a suitable helper we can split it into tap_udp6_send() and tap_icmp6_send() functions without greatly increasing the code size, this removing that layering violation. We make some small changes to the interface while there. In both cases we make the destination IPv6 address a parameter, which will be useful later. For the UDP variant we make it take just the UDP payload, and it will generate the UDP header. For the ICMP variant we pass in the ICMP header as before. The inconsistency is because that's what seems to be the more natural way to invoke the function in the callers in each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Split tap_ip_send() into IPv4 and IPv6 specific functions	David Gibson	2022-10-19	4	-96/+103
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The IPv4 and IPv6 paths in tap_ip_send() have very little in common, and it turns out that every caller (statically) knows if it is using IPv4 or IPv6. So split into separate tap_ip4_send() and tap_ip6_send() functions. Use a new tap_l2_hdr() function for the very small common part. While we're there, make some minor cleanups: - We were double writing some fields in the IPv6 header, so that it temporary matched the pseudo-header for checksum calculation. With recent checksum reworks, this isn't neccessary any more. - We don't use any IPv4 header options, so use some sizeof() constructs instead of some open coded values for header length. - The comment used to say that the flow label was for TCP over IPv6, but in fact the only thing we used it for was DHCPv6 over UDP traffic Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Remove unhelpeful vnet_pre optimization from tap_send()	David Gibson	2022-10-19	5	-24/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Callers of tap_send() can optionally use a small optimization by adding extra space for the 4 byte length header used on the qemu socket interface. tap_ip_send() is currently the only user of this, but this is used only for "slow path" ICMP and DHCP packets, so there's not a lot of value to the optimization. Worse, having the two paths here complicates the interface and makes future cleanups difficult, so just remove it. I have some plans to bring back the optimization in a more general way in future, but for now it's just in the way. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Remove support for TCP packets from tap_ip_send()	David Gibson	2022-10-19	3	-44/+2
\| \| \| \| \| \| \| \| \| \| \|	tap_ip_send() is never used for TCP packets, we're unlikely to use it for that in future, and the handling of TCP packets makes other cleanups unnecessarily awkward. Remove it. This is the only user of csum_tcp4(), so we can remove that as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add helpers for normal inbound packet destination addresses	David Gibson	2022-10-19	2	-5/+31
\| \| \| \| \| \| \| \| \| \|	tap_ip_send() doesn't take a destination address, because it's specifically for inbound packets, and the IP addresses of the guest/namespace are already known to us. Rather than open-coding this destination address logic, make helper functions for it which will enable some later cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add csum_ip4_header() helper to calculate IPv4 header checksums	David Gibson	2022-10-19	4	-4/+13
\| \| \| \| \| \| \| \|	We calculate IPv4 header checksums in at least two places, in dhcp() and in tap_ip_send. Add a helper to handle this calculation in both places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add csum_udp4() helper for calculating UDP over IPv4 checksums	David Gibson	2022-10-19	4	-2/+37
\| \| \| \| \| \| \| \| \| \| \| \| \|	At least two places in passt fill in UDP over IPv4 checksums, although since UDP checksums are optional with IPv4 that just amounts to storing a 0 (in tap_ip_send()) or leaving a 0 from an earlier initialization (in dhcp()). For consistency, add a helper for this "calculation". Just for the heck of it, add the option (compile time disabled for now) to calculate real UDP checksums. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add csum_udp6() helper for calculating UDP over IPv6 checksums	David Gibson	2022-10-19	3	-3/+28
\| \| \| \| \| \| \| \| \| \| \|	Add a helper for calculating UDP checksums when used over IPv6 For future flexibility, the new helper takes parameters for the fields in the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to be explicitly constructed. It also allows the UDP header and payload to be in separate buffers, although we don't use this yet. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add csum_icmp4() helper for calculating ICMP checksums	David Gibson	2022-10-19	3	-3/+19
\| \| \| \| \| \| \| \| \| \|	Although tap_ip_send() is currently the only place calculating ICMP checksums, create a helper function for symmetry with ICMPv6. For future flexibility it allows the ICMPv6 header and payload to be in separate buffers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Add csum_icmp6() helper for calculating ICMPv6 checksums	David Gibson	2022-10-19	4	-8/+33
\| \| \| \| \| \| \| \| \| \| \| \|	At least two places in passt calculate ICMPv6 checksums, ndp() and tap_ip_send(). Add a helper to handle this calculation in both places. For future flexibility, the new helper takes parameters for the fields in the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to be explicitly constructed. It also allows the ICMPv6 header and payload to be in separate buffers, although we don't use this yet. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt.1: Add David to AUTHORS2022_10_15.b3f3591	Stefano Brivio	2022-10-15	1	-2/+2
\| \| \| \| \| \| \|	I just realised while reading the man page. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user()	Stefano Brivio	2022-10-15	5	-63/+98
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even if CAP_NET_BIND_SERVICE is granted, we'll lose the capability in the target user namespace as we isolate the process, which means we're unable to bind to low ports at that point. Bind inbound ports, and only those, before isolate_user(). Keep the handling of outbound ports (for pasta mode only) after the setup of the namespace, because that's where we'll bind them. To this end, initialise the netlink socket for the init namespace before isolate_user() as well, as we actually need to know the addresses of the upstream interface before binding ports, in case they're not explicitly passed by the user. As we now call nl_sock_init() twice, checking its return code from conf() twice looks a bit heavy: make it exit(), instead, as we can't do much if we don't have netlink sockets. While at it: - move the v4_only && v6_only options check just after the first option processing loop, as this is more strictly related to option parsing proper - update the man page, explaining that CAP_NET_BIND_SERVICE is not the preferred way to bind ports, because passt and pasta can be abused to allow other processes to make effective usage of it. Add a note about the recommended sysctl instead - simplify nl_sock_init_do() now that it's called once for each case Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Rename pasta_setup_ns() to pasta_spawn_cmd()	David Gibson	2022-10-15	1	-9/+9
\| \| \| \| \| \| \| \| \| \|	pasta_setup_ns() no longer has much to do with setting up a namespace. Instead it's really about starting the shell or other command we want to run with pasta connectivity. Rename it and its argument structure to be less misleading. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: Only configure UID/GID mappings in userns when spawning shell	David Gibson	2022-10-15	4	-16/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When in passt mode, or pasta mode spawning a command, we create a userns for ourselves. This is used both to isolate the pasta/passt process itself and to run the spawned command, if any. Since eed17a47 "Handle userns isolation and dropping root at the same time" we've handled both cases the same, configuring the UID and GID mappings in the new userns to map whichever UID we're running as to root within the userns. This mapping is desirable when spawning a shell or other command, so that the user gets a root shell with reasonably clear abilities within the userns and netns. It's not necessarily essential, though. When not spawning a shell, it doesn't really have any purpose: passt itself doesn't need to be root and can operate fine with an unmapped user (using some of the capabilities we get when entering the userns instead). Configuring the uid_map can cause problems if passt is running with any capabilities in the initial namespace, such as CAP_NET_BIND_SERVICE to allow it to forward low ports. In this case the kernel makes files in /proc/pid owned by root rather than the starting user to prevent the user from interfering with the operation of the capability-enhanced process. This includes uid_map meaning we are not able to write to it. Whether this behaviour is correct in the kernel is debatable, but in any case we might as well avoid problems by only initializing the user mappings when we really want them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: Prevent any child processes gaining capabilities	David Gibson	2022-10-15	1	-0/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We drop our own capabilities, but it's possible that processes we exec() could gain extra privilege via file capabilities. It shouldn't be possible for us to exec() anyway due to seccomp() and our filesystem isolation. But just in case, zero the bounding and inheritable capability sets to prevent any such child from gainin privilege. Note that we do this after spawning the pasta shell/command (if any), because we do want the user to be able to give that privilege if they want. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: Replace drop_caps() with a version that actually does something	David Gibson	2022-10-15	3	-11/+92
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current implementation of drop_caps() doesn't really work because it attempts to drop capabilities from the bounding set. That's not the set that really matters, it's about limiting the abilities of things we might later exec() rather than our own capabilities. It also requires CAP_SETPCAP which we won't usually have. Replace it with a new version which uses setcap(2) to drop capabilities from the effective and permitted sets. For now we leave the inheritable set as is, since we don't want to preclude the user from passing inheritable capabilities to the command spawed by pasta. Correctly dropping caps reveals that we were relying on some capabilities we'd supposedly dropped. Re-divide the dropping of capabilities between isolate_initial(), isolate_user() and isolate_prefork() to make this work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: Refactor isolate_user() to allow for a common exit path	David Gibson	2022-10-15	1	-24/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, isolate_user() exits early if the --netns-only option is given. That works for now, but shortly we're going to want to add some logic to go at the end of isolate_user() that needs to run in all cases: joining a given userns, creating a new userns, or staying in our original userns (--netns-only). To avoid muddying those changes, here we reorganize isolate_user() to have a common exit path for all cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Replace FWRITE with a function	David Gibson	2022-10-15	4	-22/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a few places we use the FWRITE() macro to open a file, replace it's contents with a given string and close it again. There's no real reason this needs to be a macro rather than just a function though. Turn it into a function 'write_file()' and make some ancillary cleanups while we're there: - Add a return code so the caller can handle giving a useful error message - Handle the case of short write()s (unlikely, but possible) - Add O_TRUNC, to make sure we replace the existing contents entirely Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	isolation: Clarify various self-isolation steps	David Gibson	2022-10-15	3	-13/+86
\| \| \| \| \| \| \| \| \| \|	We have a number of steps of self-isolation scattered across our code. Improve function names and add comments to make it clearer what the self isolation model is, what the steps do, and why they happen at the points they happen. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Remove unhelpful drop_caps() call in pasta_start_ns()	David Gibson	2022-10-15	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	drop_caps() has a number of bugs which mean it doesn't do what you'd expect. However, even if we fixed those, the call in pasta_start_ns() doesn't do anything useful: * In the common case, we're UID 0 at this point. In this case drop_caps() doesn't accomplish anything, because even with capabilities dropped, we are still privileged. * When attaching to an existing namespace with --userns or --netns-only we might not be UID 0. In this case it's too early to drop all capabilities: we need at least CAP_NET_ADMIN to configure the tap device in the namespace. Remove this call - we will still drop capabilities a little later in sandbox(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta_start_ns() always ends in parent context	David Gibson	2022-10-15	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \|	The end of pasta_start_ns() has a test against pasta_child_pid, testing if we're in the parent or the child. However we started the child running the pasta_setup_ns function which always exec()s or exit()s, so if we return from the clone() we are always in the parent, making that test unnecessary. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: More general way of starting spawned shell as a login shell	David Gibson	2022-10-15	1	-12/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When invoked so as to spawn a shell, pasta checks explicitly for the shell being bash and if so, adds a "-l" option to make it a login shell. This is not ideal, since this is a bash specific option and requires pasta to know about specific shell variants. There's a general convention for starting a login shell, which is to prepend a "-" to argv[0]. Use this approach instead, so we don't need bash specific logic. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Move slower tests to end of test run	David Gibson	2022-10-15	1	-10/+10
\| \| \| \| \| \| \| \| \| \|	The distro and performance tests are by far the slowest part of the passt testsuite. Move them to the end of the testsuite run, so that it's easier to do a quick test during development by letting the other tests run then interrupting the test runner. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log.h: Avoid unnecessary GNU extension for token pasting	Stefano Brivio	2022-10-15	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	clang says: ./log.h:23:18: warning: token pasting of ',' and __VA_ARGS__ is a GNU extension [-Wgnu-zero-variadic-macro-arguments] We need token pasting here just because of the 'format' in trace(): drop it. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util.h: Add missing gcc pragma push before pragma pop	Stefano Brivio	2022-10-15	1	-0/+1
\| \| \| \| \| \| \| \| \|	While building with clang: ./util.h:176:24: warning: pragma diagnostic pop could not pop, no matching push [-Wunknown-pragmas] Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	icmp: Set sin6_scope_id for outbound ICMPv6 echo requests	Stefano Brivio	2022-10-15	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	If we ping a link-local address, we need to pass this to sendto(), as it will obviously fail with -EINVAL otherwise. If we ping other addresses, it's probably a good idea anyway to specify the configured outbound interface here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Drop excess colons in usage for DHCP and DNS options	Stefano Brivio	2022-10-15	1	-4/+4
\| \| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	netlink: Disable duplicate address detection for configured IPv6 address	Stefano Brivio	2022-10-15	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With default options, when we pass --config-net, the IPv6 address is actually going to be recycled from the init namespace, so it is in fact duplicated, but duplicate address detection has no way to find out. With a different configured address, that's not the case, but anyway duplicate address detection will be unable to see this. In both cases, we're wasting time for nothing. Pass the IFA_F_NODAD flag as we configure globally scoped IPv6 addresses via netlink. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	Don't create 'tap' socket for ports that are bound to loopback only	Stefano Brivio	2022-10-15	2	-69/+117
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the user specifies an explicit loopback address for a port binding, we're going to use that address for the 'tap' socket, and the same exact address for the 'spliced' socket (because those are, by definition, only bound to loopback addresses). This means that the second binding will fail, and, unexpectedly, the port is forwarded, but via tap device, which means the source address in the namespace won't be a loopback address. Make it explicit under which conditions we're creating which kind of socket, by refactoring tcp_sock_init() into two separate functions for IPv4 and IPv6 and gathering those conditions at the beginning. Also, don't create spliced sockets if the user specifies explicitly a non-loopback address, those are harmless but not desired either. Fixes: 3c6ae625101a ("conf, tcp, udp: Allow address specification for forwarded ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tcp_splice: Fix port remapping for inbound, spliced connections	Stefano Brivio	2022-10-15	3	-11/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pasta mode, when we receive a new inbound connection, we need to select a socket that was created in the namespace to proceed and connect() it to its final destination. The existing condition might pick a wrong socket, though, if the destination port is remapped, because we'll check the bitmap of inbound ports using the remapped port (stored in the epoll reference) as index, and not the original port. Instead of using the port bitmap for this purpose, store this information in the epoll reference itself, by adding a new 'outbound' bit, that's set if the listening socket was created the namespace, and unset otherwise. Then, use this bit to pick a socket on the right side. Suggested-by: David Gibson <david@gibson.dropbear.id.au> Fixes: 33482d5bf293 ("passt: Add PASTA mode, major rework") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp, tcp_splice: Adjust comments to current meaning of inbound and outbound	Stefano Brivio	2022-10-15	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For tcp_sock_init_ns(), "inbound" connections used to be the ones being established toward any listening socket we create, as opposed to sockets we connect(). Similarly, tcp_splice_new() used to handle "inbound" connections in the sense that they originated from listening sockets, and they would in turn cause a connect() on an "outbound" socket. Since commit 1128fa03fe73 ("Improve types and names for port forwarding configuration"), though, inbound connections are more broadly defined as the ones directed to guest or namepsace, and outbound the ones originating from there. Update comments for those two functions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Fix port and address checks for DNS forwarder	Stefano Brivio	2022-10-15	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	First off, as we swap endianness for source ports in udp_fill_data_v{4,6}(), we want host endianness, not network endianness. It doesn't actually matter if we use htons() or ntohs() here, but the current version is confusing. In the IPv4 path, when we remap DNS answers, we already swapped the endianness as needed for the source port: don't swap it again, otherwise we'll not map DNS answers for IPv4. In the IPv6 path, when we remap DNS answers, we want to check that they came from our upstream DNS server, not the one configured via --dns-forward (which doesn't even need to exist for this functionality to work). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tap: Don't check sequence counts when adding packets to pool	Stefano Brivio	2022-10-15	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \|	This is a minor optimisation possibility I spotted while trying to debug a hang in tap4_handler(): if we run out of space for packet sequences, it's fine to add packets to an existing per-sequence pool. We should check the count of packet sequences only once we realise that we actually need a new packet sequence. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	packet: Fix off-by-one in packet_get_do() sanity checks	Stefano Brivio	2022-10-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	An n-sized pool, or a pool with n entries, doesn't include index n, only up to n - 1. I'm not entirely sure this sanity check actually covers any practical case, but I spotted this while debugging a hang in tap4_handler() (possibly due to malformed sequence entries from qemu). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	conf: Report usage for --no-netns-quit	Stefano Brivio	2022-10-15	1	-0/+2
\| \| \| \| \| \|	Fixes: 745a9ba4284c ("pasta: By default, quit if filesystem-bound net namespace goes away") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>