passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	tcp: Don't use TCP_WINDOW_CLAMP	David Gibson	2023-11-10	2	-60/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On the L2 tap side, we see TCP headers and know the TCP window that the ultimate receiver is advertising. In order to avoid unnecessary buffering within passt/pasta (or by the kernel on passt/pasta's behalf) we attempt to advertise that window back to the original sock-side sender using TCP_WINDOW_CLAMP. However, TCP_WINDOW_CLAMP just doesn't work like this. Prior to kernel commit 3aa7857fe1d7 ("tcp: enable mid stream window clamp"), it simply had no effect on established sockets. After that commit, it does affect established sockets but doesn't behave the way we need: * It appears to be designed only to shrink the window, not to allow it to re-expand. * More importantly, that commit has a serious bug where if the setsockopt() is made when the existing kernel advertised window for the socket happens to be zero, it will now become locked at zero, stopping any further data from being received on the socket. Since this has never worked as intended, simply remove it. It might be possible to re-implement the intended behaviour by manipulating SO_RCVBUF, so we leave a comment to that effect. This kernel bug is the underlying cause of both the linked passt bug and the linked podman bug. We attempted to fix this before with passt commit d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag"). However while that commit masked the bug for some cases, it didn't really address the problem. Fixes: d3192f67c492 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag") Link: https://github.com/containers/podman/issues/20170 Link: https://bugs.passt.top/show_bug.cgi?id=74 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Rename and small cleanup to tcp_clamp_window()	David Gibson	2023-11-10	1	-11/+10
\| \| \| \| \| \| \| \| \| \| \| \|	tcp_clamp_window() is _mostly_ about using TCP_WINDOW_CLAMP to control the sock side advertised window, but it is also responsible for actually updating the conn->wnd_from_tap value. Rename to tcp_tap_window_update() to reflect that broader purpose, and pull the logic that's not TCP_WINDOW_CLAMP related out to the front. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/lib/perf_report: Fix up table highlight for pasta's local flows	Stefano Brivio	2023-11-10	1	-1/+9
\| \| \| \| \| \| \| \| \|	As commit 29269705239f ("test/perf: Small MTUs for spliced TCP aren't interesting") drops all columns for TCP test MTUs except for one, in throughput test for pasta's local flows, the first column we need to highlight in that table is now the second one. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Revert "selinux: Drop user_namespace class rules for Fedora 37"2023_11_07.56d9f6d	Stefano Brivio	2023-11-07	2	-0/+4
\| \| \| \| \| \| \| \|	This reverts commit 3fb3f0f7a59498bdea1d199eecfdbae6c608f78f: it was meant as a patch for Fedora 37 (and no later versions), not something I should have merged upstream. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Allow passt to talk over unconfined_t UNIX domain socket for --fd2023_11_07.74e6f48	Stefano Brivio	2023-11-07	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If passt is started with --fd to talk over a pre-opened UNIX domain socket, we don't really know what label might be associated to it, but at least for an unconfined_t socket, this bit of policy wouldn't belong to anywhere else: enable that here. This is rather loose, of course, but on the other hand passt will sandbox itself into an empty filesystem, so we're not really adding much to the attack surface except for what --fd is supposed to do. Reported-by: Matej Hrica <mhrica@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2247221 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Match implicit va_start() with va_end() in vlogmsg()	Stefano Brivio	2023-11-07	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	According to C99, 7.15.1: Each invocation of the va_start and va_copy macros shall be matched by a corresponding invocation of the va_end macro in the same function and the same applies to C11. I still have to come across a platform where va_end() actually does something, but thus spake the standard. This would be reported by Coverity as "Missing varargs init or cleanup" (CWE-573). Fixes: c0426ff10bc9 ("log: Add vlogmsg()") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Don't try to read bound ports from invalid file handles	Stefano Brivio	2023-11-07	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a minimal fix for what would be reported by Coverity as "Improper use of negative value" (CWE-394): port_fwd_init() doesn't guarantee that all the pre-opened file handles are actually valid. We should probably warn on failing open() and open_in_ns() in port_fwd_init(), too, but that's outside the scope of this minimal fix. Before commit 5a0485425bc9 ("port_fwd: Pre-open /proc/net/* files rather than on-demand"), we used to have a single open() call and a check after it. Fixes: 5a0485425bc9 ("port_fwd: Pre-open /proc/net/* files rather than on-demand") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	netlink: Sequence numbers are actually 32 bits wide	Stefano Brivio	2023-11-07	1	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Harmless, as we use sequence numbers monotonically anyway, but now clang-tidy reports: /home/sbrivio/passt/netlink.c:155:7: error: format specifies type 'unsigned short' but the argument has type '__u32' (aka 'unsigned int') [clang-diagnostic-format,-warnings-as-errors] nh->nlmsg_seq, seq); ^ /home/sbrivio/passt/log.h:26:7: note: expanded from macro 'die' err(__VA_ARGS__); \ ^~~~~~~~~~~ /home/sbrivio/passt/log.h:19:34: note: expanded from macro 'err' ^~~~~~~~~~~ Suppressed 222820 warnings (222816 in non-user code, 4 NOLINT). Use -header-filter=.* to display errors from all non-system headers. Use -system-headers to display errors from system headers as well. 1 warning treated as error make: *** [Makefile:255: clang-tidy] Error 1 Fixes: 9d4ab98d538f ("netlink: Add nl_do() helper for simple operations with error checking") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Simplify calculation of "omit" time for TCP throughput	David Gibson	2023-11-07	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For the TCP throughput tests, we use iperf3's -O "omit" option which ignores results for the given time at the beginning of the test. Currently we calculate this as 1/6th of the test measurement time. The purpose of -O, however, is to skip over the TCP slow start period, which in no way depends on the overall length of the test. The slow start time is roughly speaking log_2 ( max_window_size / MSS ) * round_trip_time These factors all vary between tests and machines we're running on, but we can estimate some reasonable bounds for them: * The maximum window size is bounded by the buffer sizes at each end, which shouldn't exceed 16MiB * The mss varies with the MTU we use, but the smallest we use in tests is ~256 bytes * Round trip time will vary with the system, but with these essentially local transfers it will typically be well under 1ms (on my laptop it is closer to 0.03ms) That gives a worst case slow start time of about 16ms. Setting an omit time of 0.1s uniformly is therefore more than enough, and substantially smaller than what we calculate now for the default case (10s / 6 ~= 1.7s). This reduces total time for the standard benchmark run by around 30s. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Remove unnecessary --pacing-timer options	David Gibson	2023-11-07	2	-3/+3
\| \| \| \| \| \| \| \| \| \|	We always set --pacing-timer when invoking iperf3. However, the iperf3 man page implies this is only relevant for the -b option. We only use the -b option for the UDP tests, not TCP, so remove --pacing-timer from the TCP cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: "MTU" changes in passt_tcp host to guest aren't useful	David Gibson	2023-11-07	1	-29/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The TCP packet size used on the passt L2 link (qemu socket) makes a huge difference to passt/pasta throughput; many of passt's overheads (chiefly syscalls) are per-packet. That packet size is largely determined by the MTU on the L2 link, so we benchmark for a number of different MTUs. That works well for the guest to host transfers. For the host to guest transfers, we purport to test for different MTUs, but we're not actually adjusting anything interesting. The host to guest transfers adjust the MTU on the "host's" (actually ns) loopback interface. However, that only affects the packet size for the socket going to passt, not the packet size for the L2 link that passt manages - passt can and will repack the stream into packets of its own size. Since the depacketization on that socket is handled by the kernel it doesn't have a lot of bearing on passt's performance. We can't fix this by changing the L2 link MTU from the guest side (as we do for guest to host), because that would only change the guest's view of the MTU, passt would still think it has the large MTU. We could test this by using the --mtu option to passt, but that would require restarting passt for each run, which is awkward in the current setup. So, for now, drop all the "small MTU" tests for host to guest. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Explicitly control UDP packet length, instead of MTU	David Gibson	2023-11-07	2	-94/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Packet size can make a big difference to UDP throughput, so it makes sense to measure it for a variety of different sizes. Currently we do this by adjusting the MTU on the relevant interface before running iperf3. However, the UDP packet size has no inherent connection to the MTU - it's controlled by the sender, and the MTU just affects whether the packet will make it through or be fragmented. The only reason adjusting the MTU works is because iperf3 bases its default packet size on the (path) MTU. We can test this more simply by using the -l option to the iperf3 client to directly control the packet size, instead of adjusting the MTU. As well as simplifying this lets us test different packet sizes for host to ns traffic. We couldn't do that previously because we don't have permission to change the MTU on the host. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Small MTUs for spliced TCP aren't interesting	David Gibson	2023-11-07	1	-52/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we make TCP throughput measurements for spliced connections with a number of different MTU values. However, the results from this aren't really interesting. Unlike with tap connections, spliced connections only involve the loopback interface on host and container, not a "real" external interface. lo typically has an MTU of 65535 and there is very little reason to ever change that. So, the measurements for smaller MTUs are rarely going to be relevant. In addition, the fact that we can offload all the {de,}packetization to the kernel with splice(2) means that the throughput difference between these MTUs isn't very great anyway. Remove the short MTUs and only show spliced throughput for the normal 65535 byte loopback MTU. This reduces runtime of the performance tests on my laptop by about 1 minute (out of ~24 minutes). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Start iperf3 server less often	David Gibson	2023-11-07	5	-109/+213
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we start both the iperf3 server(s) and client(s) afresh each time we want to make a bandwidth measurement. That's not really necessary as usually a whole batch of bandwidth measurements can use the same server. Split up the iperf3 directive into 3 directives: iperf3s to start the server, iperf3 to make a measurement and iperf3k to kill the server, so that we can start the server less often. This - and more importantly, the reduced number of waits for the server to be ready - reduces runtime of the performance tests on my laptop by about 4m (out of ~28minutes). For now we still restart the server between IPv4 and IPv6 tests. That's because in some cases the latency measurements we make in between use the same ports. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Get iperf3 stats from client side	David Gibson	2023-11-07	2	-19/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	iperf3 generates statistics about its run on both the client and server sides. They don't have exactly the same information, but both have the pieces we need (AFAICT the server communicates some nformation to the client over the control socket, so the most important information is in the client side output, even if measured by the server). Currently we use the server side information for our measurements. Using the client side information has several advantages though: * We can directly wait for the client to complete and we know we'll have the output we want. We don't need to sleep to give the server time to write out the results. * That in turn means we can wrap up as soon as the client is done, we don't need to wait overlong to make sure everything is finished. * The slightly different organisation of the data in the client output means that we always want the same json value, rather than requiring slightly different onces for UDP and TCP. The fact that we avoid some extra delays speeds up the overal run of the perf tests by around 7 minutes (out of around 35 minutes) on my laptop. The fact that we no longer unconditionally kill client and server after a certain time means that the client could run indefinitely if the server doesn't respond. We mitigate that by setting 1s connect timeout on the client. This isn't foolproof - if we get an initial response, but then lose connectivity this could still run indefinitely, however it does cover by far the most likely failure cases. --snd-timeout would provide more robustness, but I've hit odd failures when trying to use it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf: Remove stale iperf3c/iperf3s directives	David Gibson	2023-11-07	2	-6/+1
\| \| \| \| \| \| \| \| \| \|	Some older revisions used separate iperf3c and iperf3s test directives to invoke the iperf3 client and server. Those were combined into a single iperf3 directive some time ago, but a couple of places still have the old syntax. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Remove socket from udp_{tap,splice}_map when timed out	David Gibson	2023-11-07	1	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We save sockets bound to particular ports in udp_{tap,splice}_map for reuse later. If they're not used for a time, we time them out and close them. However, when that happened, we weren't actually removing the fds from the relevant map. That meant that later interactions on the same port could get a stale fd from the map. The stale fd might be closed, leading to unexpected EBADF errors, or it could have been re-used by a completely different socket bound to a different port, which could lead to us incorrectly forwarding packets. Reported-by: Chris Kuhn <kuhnchris@kuhnchris.eu> Reported-by: Jay <bugs.passt.top@bitsbetwixt.com> Link: https://bugs.passt.top/show_bug.cgi?id=57 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Consistently use -1 to indicate un-opened sockets in maps	David Gibson	2023-11-07	3	-5/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp uses the udp_tap_map, udp_splice_ns and udp_splice_init tables to keep track of already opened sockets bound to specific ports. We need a way to indicate entries where a socket hasn't been opened, but the code isn't consistent if this is indicated by a 0 or a -1: * udp_splice_sendfrom() and udp_tap_handler() assume that 0 indicates an unopened socket * udp_sock_init() fills in -1 for a failure to open a socket * udp_timer_one() is somewhere in between, treating only strictly positive fds as valid -1 (or, at least, negative) is really the correct choice here, since 0 is a theoretically valid fd value (if very unlikely in practice). Change to use that consistently throughout. The table does need to be initialised to all -1 values before any calls to udp_sock_init() which can happen from conf_ports(). Because C doesn't make it easy to statically initialise non zero values in large tables, this does require a somewhat awkward call to initialise the table from conf(). This is the best approach I could see for the short term, with any luck it will go away at some point when those socket tables are replaced by a unified flow table. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Add vlogmsg()	David Gibson	2023-11-07	2	-9/+17
\| \| \| \| \| \| \| \| \| \| \|	Currently logmsg() is only available as a variadic function. This is fine for normal use, but is awkward if we ever want to write wrappers around it which (for example) add standardised prefix information. To allow that, add a vlogmsg() function which takes a va_list instead, and implement logmsg() in terms of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Enable format warnings	David Gibson	2023-11-07	6	-12/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	logmsg() takes printf like arguments, but because it's not a built in, the compiler won't generate warnings if the format string and parameters don't match. Enable those by using the format attribute. Strictly speaking this is a gcc extension, but I believe it is also supported by some other common compilers. We already use some other attributes in various places. For now, just use it and we can worry about compilers that don't support it if it comes up. This exposes some warnings from existing callers, both in gcc and in clang-tidy: - Some are straight out bugs, which we correct - It's occasionally useful to invoke the logging functions with an empty string, which gcc objects to, so disable that specific warning in the Makefile - Strictly speaking the C standard requires that the parameter for a %p be a (void *), not some other pointer type. That's only likely to cause problems in practice on weird architectures with different sized representations for pointers to different types. Nonetheless add the casts to make it happy. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log: Don't define logging function 4 times	David Gibson	2023-11-07	2	-39/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In log.c we use a macro to define logging functions for each of 4 priority levels. The only difference between these is the priority we pass to vsyslog() and similar functions. Because it's done as a macro, however, the entire functions code is included in the binary 4 times. Rearrange this to take the priority level as a parameter to a regular function, then just use macros to define trivial wrappers which pass the priority level. This saves about 600 bytes of text in the executable (x86, non-AVX2). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove remaining declaration of tcp_l2_mh	Laurent Vivier	2023-11-07	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \|	Use of tcp_l2_mh has been removed in commit 38fbfdbcb95d, but its declaration and initialization are always in the code. Remove them as they are useless. Fixes: 38fbfdbcb95d ("tcp: Get rid of iov with cached MSS, drop sendmmsg(), add deferred flush") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Acked-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Simplify selection of socket and pipe sides in socket handler	David Gibson	2023-11-07	1	-59/+22
\| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_splice_sock_handler() uses the tcp_splice_dir() helper to select which of the socket, pipe and counter fields to use depending on which side of the connection the socket event is coming from. Now that we are using arrays for the two sides, rather than separate named fields, we can instead just use a variable indicating the side and use that to index the arrays whever we need a particular side's field. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Exploit side symmetry in tcp_splice_destroy()	David Gibson	2023-11-07	1	-18/+14
\| \| \| \| \| \| \| \| \|	tcp_splice_destroy() has some close-to-duplicated logic handling closing of the socket and pipes for each side of the connection. We can use a loop across the sides to reduce the duplication. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Exploit side symmetry in tcp_splice_connect_finish()	David Gibson	2023-11-07	1	-40/+25
\| \| \| \| \| \| \| \| \|	tcp_splice_connect_finish() has two very similar blocks opening the two pipes for each direction of the connection. We can deduplicate this with a loop across the two sides. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Exploit side symmetry in tcp_splice_timer()	David Gibson	2023-11-07	1	-16/+11
\| \| \| \| \| \| \| \| \|	tcp_splice_timer() has two very similar blocks one after another that handle the SO_RCVLOWAT flags for the two sides of the connection. We can deduplicate this with a loop across the two sides. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Rename sides of connection from a/b to 0/1	David Gibson	2023-11-07	2	-139/+130
\| \| \| \| \| \| \| \| \| \| \| \| \|	Each spliced connection has two mostly, although not entirely, symmetric sides. We currently call those "a" and "b" and have different fields in the connection structure for each one. We can better exploit that symmetry if we use two element arrays rather thatn separately named fields. Do that in the places we can, and for the others change the "a"/"b" terminology to 0/1 to match. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Don't pool pipes in pairs	David Gibson	2023-11-07	1	-29/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To reduce latencies, the tcp splice code maintains a pool of pre-opened pipes to use for new connections. This is structured as an array of pairs of pipes, with each pipe, of course, being a pair of fds. Thus when we use the pool, a single pool "slot" provides both the a->b and b->a pipes. There's no strong reason to store the pool in pairs, though - we can with not much difficulty instead take the a->b and b->a pipes for a new connection independently from separate slots in the pool, or even take one from the the pool and create the other as we need it, if there's only one pipe left in the pool. This marginally increases the length of code, but simplifies the structure of the pipe pool. We should be able to re-shrink the code with later changes, too. In the process we also fix some minor bugs: - If we both failed to find a pipe in the pool and to create a new one, we didn't log an error and would silently drop the connection. That could make debugging such a situation difficult. Add in an error message for that case - When refilling the pool, if we were only able to open a single pipe in the pair, we attempted to rollback, but instead of closing the opened pipe, we instead closed the pipe we failed to open (probably leading to some ignored EBADFD errors). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Avoid awkward temporaries in tcp_splice_epoll_ctl()	David Gibson	2023-11-07	1	-13/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We initialise the events_a and events_b variables with tcp_splice_conn_epoll_events() function, then immediately copy the values into ev_a.events and ev_b.events. We can't simply pass &ev_[ab].events to tcp_splice_conn_epoll_events(), because struct epoll_event is packed, leading to 'pointer may be unaligned' warnings if we attempt that. We can, however, make tcp_splice_conn_epoll_events() take struct epoll_event pointers rather than raw u32 pointers, avoiding the awkward temporaries. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Remove unnecessary forward declaration	David Gibson	2023-11-07	1	-37/+34
\| \| \| \| \| \| \| \| \|	In tcp_splice.c we forward declare tcp_splice_epoll_ctl() then define it later on. However, there are no circular dependencies which prevent us from simply having the full definition in place of the forward declaration. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Don't handle EPOLL_CTL_DEL as part of tcp_splice_epoll_ctl()	David Gibson	2023-11-07	1	-8/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_splice_epoll_ctl() removes both sockets from the epoll set if called when conn->flags & CLOSING. This will always happen immediately after setting that flag, since conn_flag_do() makes the call itself. That's also the _only_ time it can happen: we perform the EPOLL_CTL_DEL without clearing the conn->in_epoll flag, meaning that any further calls to tcp_splice_epoll_ctl() would attempt EPOLL_CTL_MOD, which would necessarily fail since the fds are no longer in the epoll. The EPOLL_CTL_DEL path in tcp_splice_epoll_ctl() has essentially zero overlap with anything else the function does, so just move them to be open coded in conn_flag_do(). This does require kernel 2.6.9 or later, in order to pass NULL as the event structure for epoll_ctl(). However, we already require at least 3.13 to allow unprivileged user namespaces. Given that, simply directly perform the EPOLL_CTL_DEL operations from conn_flag_do() rather than unnecessarily multiplexini Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Correct error handling in tcp_splice_epoll_ctl()	David Gibson	2023-11-07	1	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we get an error from epoll_ctl() in tcp_splice_epoll_ctl() we goto the 'delete' path where we remove both sockets from the epoll set and return an error. There are several problems with this: - We 'return -errno' after the EPOLL_CTL_DEL operations, which means the deleting epoll_ctl() calls may have overwritten the errno values which actually triggered the failures. - The call from conn_flag_do() occurs when the CLOSING flag is set, in which case we go do the delete path regardless of error. In that case the 'return errno' is meaningless since we don't expect the EPOLL_CTL_DEL operations to fail and we ignore the return code anyway. - All other calls to tcp_splice_epoll_ctl() check the return code and if non-zero immediately call conn_flag(..., CLOSING) which will call tcp_splice_epoll_ctl() again explicitly to remove the sockets from epoll. That means removing them when the error first occurs is redundant. - We never specifically report an error on the epoll_ctl() operations. We just set the connection to CLOSING, more or less silently killing it. This could make debugging difficult in the unlikely even that we get a failure here. Re-organise tcp_splice_epoll_ctl() to just log a message then return in the error case, and only EPOLL_CTL_DEL when explicitly asked to with the CLOSING flag. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp_splice: Remove redundant tcp_splice_epoll_ctl()	David Gibson	2023-11-07	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_splice_conn_update() calls tcp_splice_epoll_ctl() twice: first ignoring the return value, then checking it. This serves no purpose. If the first call succeeds, the second call will do exactly the same thing again, since nothing has changed in conn. If the first call fails, then tcp_splice_epoll_ctl() itself will EPOLL_CTL_DEL both fds, meaning when the second call tries to EPOLL_CTL_MOD them it will necessarily fail. It appears that this duplication was introduced by accident in an otherwise unrelated patch. Fixes: bb708111 ("treewide: Packet abstraction with mandatory boundary checks") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Pass originating pif to tap handler functions	David Gibson	2023-11-07	7	-17/+34
\| \| \| \| \| \| \| \| \| \| \|	For now, packets passed to the various *_tap_handler() functions always come from the single "tap" interface. We want to allow the possibility to broaden that in future. As preparation for that, have the code in tap.c pass the pif id of the originating interface to each of those handler functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Record originating pif in listening socket refs	David Gibson	2023-11-07	6	-23/+28
\| \| \| \| \| \| \| \| \| \|	For certain socket types, we record in the epoll ref whether they're sockets in the namespace, or on the host. We now have the notion of "pif" to indicate what "place" a socket is associated with, so generalise the simple one-bit 'ns' to a pif id. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pif: Introduce notion of passt/pasta interface	David Gibson	2023-11-07	2	-1/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have several possible ways of communicating with other entities. We use sockets to communicate with the host and other network sites, but also in a different context to communicate "spliced" channels to a namespace. We also use a tuntap device or a qemu socket to communicate with the namespace or guest. For the time being these are just defined implicitly by how we structure things. However, there are other communication channels we want to use in future (e.g. virtio-user), and we want to allow more flexible forwarding between those. To accomplish that we're going to want a specific way of referring to those channels. Introduce the concept of a "passt/pasta interface" or "pif" representing a specific channel to communicate network data. Each pif is assumed to be associated with a specific network namespace in the broad sense (that is as a place where IP addresses have a consistent meaning - not the Linux specific sense). But there could be multiple pifs communicating with the same namespace (e.g. the spliced and tap interfaces in pasta). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Clean up ref initialisation in udp_sock_init()	David Gibson	2023-11-07	1	-5/+2
\| \| \| \| \| \| \| \| \|	udp_sock_init() has a number of paths that initialise uref differently. However some of the fields are initialised the same way in all of them. Move those fields into the original initialiser to save a few lines. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Simplify get_bound_ports_() to port_fwd_scan_()	David Gibson	2023-11-07	3	-38/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	get_bound_ports_() now only use their context and ns parameters to determine which forwarding maps they're operating on. Each function needs the map they're actually updating, as well as the map for the other direction, to avoid creating forwarding loops. The UDP function also requires the corresponding TCP map, to implement the behaviour where we forward UDP ports of the same number as bound TCP ports for tools like iperf3. Passing those maps directly as parameters simplifies the code without making the callers life harder, because those already know the relevant maps. IMO, invoking these functions in terms of where they're looking for updated forwarding also makes more logical sense than in terms of where they're looking for bound ports. Given that new way of looking at the functions, also rename them to port_fwd_scan_(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Move port scanning /proc fds into struct port_fwd	David Gibson	2023-11-07	3	-35/+36
\| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we store /proc/net fds used to implement automatic port forwarding in the proc_net_{tcp,udp} fields of the main context structure. However, in fact each of those is associated with a particular direction of forwarding, and we already have struct port_fwd which collects all other information related to a particular direction of port forwarding. We can simplify things a bit by moving the /proc fds into struct port_fwd. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Split TCP and UDP cases for get_bound_ports()	David Gibson	2023-11-07	3	-36/+47
\| \| \| \| \| \| \| \| \| \| \| \|	Currently get_bound_ports() takes a parameter to determine if it scans for UDP or TCP bound ports, but in fact there's almost nothing in common between those two paths. The parameter appears primarily to have been a convenience for when we needed to invoke this function via NS_CALL(). Now that we don't need that, split it into separate TCP and UDP versions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Don't NS_CALL get_bound_ports()	David Gibson	2023-11-07	2	-71/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we want to scan for bound ports in the namespace we use NS_CALL() to run get_bound_ports() in the namespace. However, the only thing it actually needed to be in the namespace for was to open the /proc/net file it was scanning. Since we now always pre-open those, we no longer need to switch to the namespace for the actual get_bound_ports() calls. That in turn means that tcp_port_detect() doesn't need to run in the ns either, and we can just replace it with inline calls to get_bound_ports(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Pre-open /proc/net/* files rather than on-demand	David Gibson	2023-11-07	1	-18/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	procfs_scan_listen() can either use an already opened fd for a /proc/net file, or it will open it. So, effectively it will open the file on the first call, then re-use the fd in subsequent calls. However, it's not possible to open the /proc/net files after we isolate our filesystem in isolate_prefork(). That means that for each /proc/net file we must call procfs_scan_listen() at least once before isolate_prefork(), or it won't work afterwards. That happens to be the case, but it's a pretty fragile requirement. To make this more robust, instead always pre-open the /proc files we will need in get_bounds_port_init() and have procfs_scan_listen() just use those. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Add open_in_ns() helper	David Gibson	2023-11-07	2	-0/+54
\| \| \| \| \| \| \| \| \| \|	Most of our helpers which need to enter the pasta network namespace are quite specialised. Add one which is rather general - it just open()s a given file in the namespace context and returns the fd back to the main namespace. This will have some future uses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Better parameterise procfs_scan_listen()	David Gibson	2023-11-07	1	-31/+24
\| \| \| \| \| \| \| \| \| \| \| \| \|	procfs_scan_listen() does some slightly clunky logic to deduce the fd it wants to use, the path it wants to open and the state it's looking for based on parameters for protocol, IP version and whether we're in the namespace. However, the caller already has to make choices based on similar parameters so it can just pass in the things that procfs_scan_listen() needs directly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Move automatic port forwarding code to port_fwd.[ch]	David Gibson	2023-11-07	8	-153/+185
\| \| \| \| \| \| \| \| \| \| \| \| \|	The implementation of scanning /proc files to do automatic port forwarding is a bit awkwardly split between procfs_scan_listen() in util.c, get_bound_ports() and related functions in conf.c and the initial setup code in conf(). Consolidate all of this into port_fwd.h, which already has some related definitions, and a new port_fwd.c. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Cleaner initialisation of default forwarding modes	David Gibson	2023-11-07	1	-33/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Initialisation of the forwarding mode variables is complicated a bit by the fact that we have different defaults for passt and pasta modes. This leads to some debateably duplicated code between those two cases. More significantly, however, currently the final setting of the mode variable is interleaved with the code to actually start automatic scanning when that's selected. This essentially mingles "policy" code (setting the default mode), with implementation code (making that happen). That's a bit inflexible if we want to change policies in future. Disentangle these two pieces, and use a slightly different construction to make things briefer as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	selinux: Drop user_namespace class rules for Fedora 37	Stefano Brivio	2023-11-07	2	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	With current selinux-policy-37.22-1.fc37.noarch, and presumably any future update for Fedora 37, the user_namespace class is not available, so statements using it prevent the policy from being loaded. If a class is not defined in the base policy, any related permission is assumed to be enabled, so we can safely drop those. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2237996 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	dhcp: put option 53 at the beginning2023_10_04.f851084	Stas Sergeev	2023-10-04	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	... unless it is listed in 55. Many clients expect option 53 at the beginning. mTCP has this code: if ( resp->options[0] != 53 ) { TRACE_WARN(( "Dhcp: first option was not a Dhcp msg type\n" )); return; } wattcp32 has this: static int DHCP_is_ack (void) { const BYTE opt = (const BYTE) &dhcp_in.dh_opt[4]; return (opt[0] == DHCP_OPT_MSG_TYPE && opt[1] == 1 && opt[2] == DHCP_ACK); } static int DHCP_is_nack (void) { const BYTE opt = (const BYTE) &dhcp_in.dh_opt[4]; return (opt[0] == DHCP_OPT_MSG_TYPE && opt[1] == 1 && opt[2] == DHCP_NAK); } Link: https://bugs.passt.top/show_bug.cgi?id=77 Signed-off-by: Stas Sergeev <stsp2@yandex.ru> [sbrivio: s/options 53/option 53/ and s/other/others/ in comment] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, tap: Don't increase tap-side sequence counter for dropped frames	Stefano Brivio	2023-10-04	3	-10/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...so that we'll retry sending them, instead of more-or-less silently dropping them. This happens quite frequently if our sending buffer on the UNIX domain socket is heavily constrained (for instance, by the 208 KiB default memory limit). It might be argued that dropping frames is part of the expected TCP flow: we don't dequeue those from the socket anyway, so we'll eventually retransmit them. But we don't need the receiver to tell us (by the way of duplicate or missing ACKs) that we couldn't send them: we already know as sendmsg() reports that. This seems to considerably increase throughput stability and throughput itself for TCP connections with default wmem_max values. Unfortunately, the 16 bits left as padding in the frame descriptors we use internally aren't enough to uniquely identify for which connection we should update sequence numbers: create a parallel array of pointers to sequence numbers and L4 lengths, of TCP_FRAMES_MEM size, and go through it after calling sendmsg(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag	Stefano Brivio	2023-10-04	1	-5/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It looks like we need it as workaround for this situation, readily reproducible at least with a 6.5 Linux kernel, with default rmem_max and wmem_max values: - an iperf3 client on the host sends about 160 KiB, typically segmented into five frames by passt. We read this data using MSG_PEEK - the iperf3 server on the guest starts receiving - meanwhile, the host kernel advertised a zero-sized window to the sender, as expected - eventually, the guest acknowledges all the data sent so far, and we drop it from the buffer, courtesy of tcp_sock_consume(), using recv() with MSG_TRUNC - the client, however, doesn't get an updated window value, and even keepalive packets are answered with zero-window segments, until the connection is closed It looks like dropping data from a socket using MSG_TRUNC doesn't cause a recalculation of the window, which would be expected as a result of any receiving operation that invalidates data on a buffer (that is, not with MSG_PEEK). Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to the previous value we clamped to, forces a recalculation of the window which is advertised to the sender. I couldn't quite confirm this issue by following all the possible code paths in the kernel, yet. If confirmed, this should be fixed in the kernel, but meanwhile this workaround looks robust to me (and it will be needed for backward compatibility anyway). Reported-by: Matej Hrica <mhrica@redhat.com> Link: https://bugs.passt.top/show_bug.cgi?id=74 Analysed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>