passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	conf: Add command line switch to enable IP_FREEBIND socket option	David Gibson	2024-10-04	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a couple of recent reports, we've seen that it can be useful for pasta to forward ports from addresses which are not currently configured on the host, but might be in future. That can be done with the sysctl net.ipv4.ip_nonlocal_bind, but that does require CAP_NET_ADMIN to set in the first place. We can allow the same thing on a per-socket basis with the IP_FREEBIND (or IPV6_FREEBIND) socket option. Add a --freebind command line argument to enable this socket option on all listening sockets. Link: https://bugs.passt.top/show_bug.cgi?id=101 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util, pif: Replace sock_l4() with pif_sock_l4()	David Gibson	2024-09-25	1	-52/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The sock_l4() function is very convenient for creating sockets bound to a given address, but its interface has some problems. Most importantly, the address and port alone aren't enough in some cases. For link-local addresses (at least) we also need the pif in order to properly construct a socket adddress. This case doesn't yet arise, but it might cause us trouble in future. Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it should attempt to create a "dual stack" socket which will respond to both IPv4 and IPv6 traffic. This only makes sense if there is no specific address given. We verify this at runtime, but it would be nicer if we could enforce it structurally. For sockets associated specifically with a single flow we already replaced sock_l4() with flowside_sock_l4() which avoids those problems. Now, replace all the remaining users with a new pif_sock_l4() which also takes an explicit pif. The new function takes the address as an inany *, with NULL indicating the dual stack case. This does add some complexity in some of the callers, however future planned cleanups should make this go away again. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
*	util: Remove possible quadratic behaviour from write_remainder()	David Gibson	2024-09-18	1	-10/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	write_remainder() steps through the buffers in an IO vector writing out everything past a certain byte offset. However, on each iteration it rescans the buffer from the beginning to find out where we're up to. With an unfortunate set of write sizes this could lead to quadratic behaviour. In an even less likely set of circumstances (total vector length > maximum size_t) the 'skip' variable could overflow. This is one factor in a longstanding Coverity error we've seen (although I still can't figure out the remainder of its complaint). Rework write_remainder() to always work out our new position in the vector relative to our old/current position, rather than starting from the beginning each time. As a bonus this seems to fix the Coverity error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Add helper to write() all of a buffer	David Gibson	2024-09-18	1	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	write(2) might not write all the data it is given. Add a write_all_buf() helper to keep calling it until all the given data is written, or we get an error. Currently we use write_remainder() to do this operation in pcap_frame(). That's a little awkward since it requires constructing an iovec, and future changes we want to make to write_remainder() will be easier in terms of this single buffer helper. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Fix order of operands and carry of one second in timespec_diff_us()	Stefano Brivio	2024-09-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the nanoseconds of the minuend timestamp are less than the nanoseconds of the subtrahend timestamp, we need to carry one second in the subtraction. I subtracted this second from the minuend, but didn't actually carry it in the subtraction of nanoseconds, and logged timestamps would jump back whenever we switched to the first branch of timespec_diff_us() from the second one. Most likely, the reason why I didn't carry the second is that I instinctively thought that swapping the operands would have the same effect. But it doesn't, in general: that only happens with arithmetic in modulo powers of 2. Undo the swap as well. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	util: Don't stop on unrelated values when looking for --fd in close_open_files()	Stefano Brivio	2024-08-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Seen with krun: we get a file descriptor via --fd, but we close it and happily use the same number for TCP files. The issue is that if we also get other options before --fd, with arguments, getopt_long() stops parsing them because it sees them as non-option values. Use the - modifier at the beginning of optstring (before :, which is needed to avoid printing errors) instead of +, which means we'll continue parsing after finding unrelated option values, but getopt_long() won't reorder them anyway: they'll be passed with option value '1', which we can ignore. By the way, we also need to add : after F in the optstring, so that we're able to parse the option when given as short name as well. Now that we change the parsing mode between close_open_files() and conf(), we need to reset optind to 0, not to 1, whenever we call getopt_long() again in conf(), so that the internal initialisation of getopt_long() evaluating GNU extensions is re-triggered. Link: https://github.com/slp/krun/issues/17#issuecomment-2294943828 Fixes: baccfb95ce0e ("conf: Stop parsing options at first non-option argument") Fixes: 09603cab28f9 ("passt, util: Close any open file that the parent might have leaked") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	util: Correct sock_l4() binding for link local addresses	David Gibson	2024-08-21	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When binding an IPv6 socket in sock_l4() we need to supply a scope id if the address is link-local. We check for this by comparing the given address to c->ip6.addr_ll. This is correct only by accident: while c->ip6.addr_ll is typically set to the host interface's link local address, the actual purpose of it is to provide a link local address for passt's private use on the tap interface. Instead set the scope id for any link-local address we're binding to. We're going to need something and this is what makes sense for sockets on the host. It doesn't make sense for PIF_SPLICE sockets, but those should always have loopback, not link-local addresses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Helper for formatting MAC addresses	David Gibson	2024-08-21	1	-0/+19
\| \| \| \| \| \| \| \| \|	There are a couple of places where we somewhat messily open code formatting an Ethernet like MAC address for display. Add an eth_ntop() helper for this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf: Stop parsing options at first non-option argument	Stefano Brivio	2024-08-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Given that pasta supports specifying a command to be executed on the command line, even without the usual -- separator as long as there's no ambiguity, we shouldn't eat up options that are not meant for us. Paul reports, for instance, that with: pasta --config-net ip -6 route -6 is taken by pasta to mean --ipv6-only, and we execute 'ip route'. That's because getopt_long(), by default, shuffles the argument list to shift non-option arguments at the end. Avoid that by adding '+' at the beginning of 'optstring'. Reported-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	passt, util: Close any open file that the parent might have leaked	Stefano Brivio	2024-08-08	1	-0/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a parent accidentally or due to implementation reasons leaks any open file, we don't want to have access to them, except for the file passed via --fd, if any. This is the case for Podman when Podman's parent leaks files into Podman: it's not practical for Podman to close unrelated files before starting pasta, as reported by Paul. Use close_range(2) to close all open files except for standard streams and the one from --fd. Given that parts of conf() depend on other files to be already opened, such as the epoll file descriptor, we can't easily defer this to a more convenient point, where --fd was already parsed. Introduce a minimal, duplicate version of --fd parsing to keep this simple. As we need to check that the passed --fd option doesn't exceed INT_MAX, because we'll parse it with strtol() but file descriptor indices are signed ints (regardless of the arguments close_range() take), extend the existing check in the actual --fd parsing in conf(), also rejecting file descriptors numbers that match standard streams, while at it. Suggested-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Paul Holzinger <pholzing@redhat.com>
*	util: Some corrections for timespec_diff_us	David Gibson	2024-08-07	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The comment for timespec_diff_us() claims it will wrap after 2^64µs. This is incorrect for two reasons: * It returns a long long, which is probably 64-bits, but might not be * It returns a signed value, so even if it is 64 bits it will wrap after 2^63µs Correct the comment and use an explicitly 64-bit type to avoid that imprecision. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Use unsigned (size_t) value for iov length	David Gibson	2024-08-06	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	The "correct" type for the length of an IOV is unclear: writev() and readv() use an int, but sendmsg() and recvmsg() use a size_t. Using the unsigned size_t has some advantages, though, and it makes more sense for the case of write_remainder. Using size_t throughout here means we don't have a signed vs. unsigned comparison, and we don't have to deal with the case of iov_skip_bytes() returning a value which becomes negative when assigned to an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log, util: Fix sub-second part in relative log time calculation	Stefano Brivio	2024-07-26	1	-7/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For some reason, in commit 01efc71ddd25 ("log, conf: Add support for logging to file"), I added calculations for relative logging timestamps using the difference for the seconds part only, not for accounting for the fractional part. Fix that by storing the initial timestamp, log_start, as a timespec struct, and by calculating the difference from the starting time. Do this in a macro as we need the same format in a few places. To calculate the difference, turn the existing timespec_diff_ms() to microseconds, timespec_diff_us(), and rewrite timespec_diff_ms() to use that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Rename UDP listening sockets	David Gibson	2024-07-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived sockets which can initiate new flows. Rename to EPOLL_TYPE_UDP_LISTEN and associated functions to match. Along with that, remove the .orig field from union udp_listen_epoll_ref, since it is now always true. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Handle "spliced" datagrams with per-flow sockets	David Gibson	2024-07-19	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use. For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	flow: Helper to create sockets based on flowside	David Gibson	2024-07-19	1	-3/+3
\| \| \| \| \| \| \| \| \|	We have upcoming use cases where it's useful to create new bound socket based on information from the flow table. Add flowside_sock_l4() to do this for either PIF_HOST or PIF_SPLICE sockets. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Handle errors on UDP sockets	David Gibson	2024-07-17	1	-0/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we ignore all events other than EPOLLIN on UDP sockets. This means that if we ever receive an EPOLLERR event, we'll enter an infinite loop on epoll, because we'll never do anything to clear the error. Luckily that doesn't seem to have happened in practice, but it's certainly fragile. Furthermore changes in how we handle UDP sockets with the flow table mean we will start receiving error events. Add handling of EPOLLERR events. For now we just read the error from the error queue (thereby clearing the error state) and print a debug message. We can add more substantial handling of specific events in future if we want to. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Add AF_UNSPEC support to sockaddr_ntop()	David Gibson	2024-07-17	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \|	Allow sockaddr_ntop() to format AF_UNSPEC socket addresses. There do exist a few cases where we might legitimately have either an AF_UNSPEC or a real address, such as the origin address from MSG_ERRQUEUE. Even in cases where we shouldn't get an AF_UNSPEC address, formatting it is likely to make things easier to debug if we ever somehow do. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: sock_l4() determine protocol from epoll type rather than the reverse	David Gibson	2024-07-05	1	-22/+26
\| \| \| \| \| \| \| \| \| \| \| \|	sock_l4() creates a socket of the given IP protocol number, and adds it to the epoll state. Currently it determines the correct tag for the epoll data based on the protocol. However, we have some future cases where we might want different semantics, and therefore epoll types, for sockets of the same protocol. So, change sock_l4() to take the epoll type as an explicit parameter, and determine the protocol from that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Replace strerror() calls	Stefano Brivio	2024-06-21	1	-7/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that we have logging functions embedding perror() functionality, we can make _some_ calls more terse by using them. In many places, the strerror() calls are still more convenient because, for example, they are used in flow debugging functions, or because the return code variable of interest is not 'errno'. While at it, convert a few error messages from a scant perror style to proper failure descriptions. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	util: Split construction of bind socket address from the rest of sock_l4()	David Gibson	2024-06-14	1	-53/+70
\| \| \| \| \| \| \| \| \| \| \| \| \|	sock_l4() creates, binds and otherwise prepares a new socket. It builds the socket address to bind from separately provided address and port. However, we have use cases coming up where it's more natural to construct the socket address in the caller. Prepare for this by adding sock_l4_sa() which takes a pre-constructed socket address, and rewriting sock_l4() in terms of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Use 'long' to represent millisecond durations	David Gibson	2024-06-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	timespec_diff_ms() returns an int representing a duration in milliseconds. This will overflow in about 25 days when an int is 32 bits. The way we use this function, we're probably not going to get a result that long, but it's not outrageously implausible. Use a long for safety. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Use unsigned indices for bits in bitmaps	David Gibson	2024-06-07	1	-4/+4
\| \| \| \| \| \| \| \| \|	A negative bit index in a bitmap doesn't make sense. Avoid this by construction by using unsigned indices. While we're there adjust bitmap_isset() to return a bool instead of an int. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt, util: Move opening of PID file to its own function	Stefano Brivio	2024-05-23	1	-0/+22
\| \| \| \| \| \| \|	We won't call it from main() any longer: move it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	util: Rename write_pidfile() to pidfile_write()	Stefano Brivio	2024-05-23	1	-3/+3
\| \| \| \| \| \| \| \|	As I'm adding pidfile_open() in the next patch. The function comment didn't match, by the way. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
*	util, tcp: Add helper to display socket addresses	David Gibson	2024-05-22	1	-0/+56
\| \| \| \| \| \| \| \| \| \| \|	When reporting errors, we sometimes want to show a relevant socket address. Doing so by extracting the various relevant fields can be pretty awkward, so introduce a sockaddr_ntop() helper to make it simpler. For now we just have one user in tcp.c, but I have further upcoming patches which can make use of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: fix confusion between offset in the iovec array and in the entry2024_03_20.71dd405	Laurent Vivier	2024-03-20	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In write_remainder() 'skip' is the offset to start the operation from in the iovec array. In iov_skip_bytes(), 'skip' is also the offset in the iovec array but 'offset' is the first unskipped byte in the iovec entry. As write_remainder() uses 'skip' for both, 'skip' is reset to the first unskipped byte in the iovec entry rather to staying the first unskipped byte in the iovec array. Fix the problem by introducing a new variable not to overwrite 'skip' on each loop. Fixes: 8bdb0883b441 ("util: Add write_remainder() helper") Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	icmp: Use 'flowside' epoll references for ping sockets	David Gibson	2024-03-12	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently ping sockets use a custom epoll reference type which includes the ICMP id. However, now that we have entries in the flow table for ping flows, finding that is sufficient to get everything else we want, including the id. Therefore remove the icmp_epoll_ref type and use the general 'flowside' field for ping sockets. Having done this we no longer need separate EPOLL_TYPE_ICMP and EPOLL_TYPE_ICMPV6 reference types, because we can easily determine which case we have from the flow type. Merge both types into EPOLL_TYPE_PING. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: move IP stuff from util.[ch] to ip.[ch]	Laurent Vivier	2024-03-06	1	-55/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Introduce ip.[ch] file to encapsulate IP protocol handling functions and structures. Modify various files to include the new header ip.h when it's needed. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-ID: <20240303135114.1023026-5-lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Add write_remainder() helper	David Gibson	2024-02-29	1	-0/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have several places where we want to write(2) a buffer or buffers and we handle short write()s by retrying until everything is successfully written. Add a helper for this in util.c. This version has some differences from the typical write_all() function. First, take an IO vector rather than a single buffer, because that will be useful for some of our cases. Second, allow it to take an parameter to skip the first n bytes of the given buffers. This will be useful for some of the cases we want, and also falls out quite naturally from the implementation. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Minor formatting fixes in write_remainder()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Use sa_family_t for address family variables	David Gibson	2024-02-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Sometimes we use sa_family_t for variables and parameters containing a socket address family, other times we use a plain int. Since sa_family_t is what's actually used in struct sockaddr and friends, standardise on that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Make a bunch of pointer variables pointers to const	David Gibson	2024-01-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another warning for pointer variables which could be pointer to const but aren't. Use this to make a bunch of variables const pointers where they previously weren't for no particular reason. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Make sock_l4() treat empty string ifname like NULL	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	sock_l4() takes NULL for ifname if you don't want to bind the socket to a particular interface. However, for a number of the callers, it's more natural to use an empty string for that case. Change sock_l4() to accept either NULL or an empty string equivalently, and simplify some callers using that change. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Avoid in_addr_t	David Gibson	2023-12-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	IPv4 addresses can be stored in an in_addr_t or a struct in_addr. The former is just a type alias to a 32-bit integer, so doesn't really give us any type checking. Therefore we generally prefer the structure, since we mostly want to treat IP address as opaque objects. Fix a few places where we still use in_addr_t, but can just as easily use struct in_addr. Note there are still some uses of in_addr_t in conf.c, but those are justified: since they're doing prefix calculations, they actually need to look at the internals of the address as an integer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Drop explicit setting to INADDR_ANY/in6addr_any in sock_l4()	David Gibson	2023-12-27	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The original commit message says: --- Currently we initialise the address field of the sockaddrs we construct to the any/unspecified address, but not in a very clear way: we use explicit 0 values, which is only interpretable if you know the order of fields in the sockaddr structures. Use explicit field names, and explicit initialiser macros for the address. Because we initialise to this default value, we don't need to explicitly set the any/unspecified address later on if the caller didn't pass an overriding bind address. --- and the original patch modified the initialisation of addr4 and addr6: - instead of { 0 }, { 0 } for sin_addr and sin_zero, .sin_addr = IN4ADDR_ANY_INIT - instead of 0, IN6ADDR_ANY_INIT, 0: .sin6_addr = IN6ADDR_ANY_INIT but I dropped those hunks: they break gcc versions 7 to 9 as reported in eed6933e6c29 ("udp: Explicitly initialise sin6_scope_id and sin_zero in sockaddr_in{,6}"). I applied the rest of the changes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Dropped first two hunks] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd, util: Don't bind UDP ports with opposite-side bound TCP ports	Stefano Brivio	2023-11-22	1	-0/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When pasta periodically scans bound ports and binds them on the other side in order to forward traffic, we bind UDP ports for corresponding TCP port numbers, too, to support protocols and applications such as iperf3 which use UDP port numbers matching the ones used by the TCP data connection. If we scan UDP ports in order to bind UDP ports, we skip detection of the UDP ports we already bound ourselves, to avoid looping back our own ports. Same with scanning and binding TCP ports. But if we scan for TCP ports in order to bind UDP ports, we need to skip bound TCP ports too, otherwise, as David pointed out: - we find a bound TCP port on side A, and bind the corresponding TCP and UDP ports on side B - at the next periodic scan, we find that UDP port bound on side B, and we bind the corresponding UDP port on side A - at this point, we unbind that UDP port on side B: we would otherwise loop back our own port. To fix this, we need to avoid binding UDP ports that we already bound, on the other side, as a consequence of finding a corresponding bound TCP port. Reproducing this issue is straightforward: ./pasta -- iperf3 -s # Wait one second, then from another terminal: iperf3 -c ::1 -u Reported-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp> Analysed-by: David Gibson <david@gibson.dropbear.id.au> Fixes: 457ff122e33c ("udp,pasta: Periodically scan for ports to automatically forward") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Add open_in_ns() helper	David Gibson	2023-11-07	1	-0/+53
\| \| \| \| \| \| \| \| \| \|	Most of our helpers which need to enter the pasta network namespace are quite specialised. Add one which is rather general - it just open()s a given file in the namespace context and returns the fd back to the main namespace. This will have some future uses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	port_fwd: Move automatic port forwarding code to port_fwd.[ch]	David Gibson	2023-11-07	1	-64/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	The implementation of scanning /proc files to do automatic port forwarding is a bit awkwardly split between procfs_scan_listen() in util.c, get_bound_ports() and related functions in conf.c and the initial setup code in conf(). Consolidate all of this into port_fwd.h, which already has some related definitions, and a new port_fwd.c. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	cppcheck: Make many pointers const	David Gibson	2023-10-04	1	-2/+3
\| \| \| \| \| \| \| \| \|	Newer versions of cppcheck (as of 2.12.0, at least) added a warning for pointers which could be declared to point at const data, but aren't. Based on that, make many pointers throughout the codebase const. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Avoid shadowing index(3)	David Gibson	2023-09-27	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A classic gotcha of the standard C library is that its unwise to call any variable 'index' because it will shadow the standard string library function index(3). This can cause warnings from cppcheck amongst others, and it also means that if the variable is removed you tend to get confusing type errors (or sometimes nothing at all) instead of a nice simple "name is not defined" error. Strictly speaking this only occurs if <string.h> is included, but that is so common that as a rule it's best to just avoid it always. We have a number of places which hit this trap, so rename variables and parameters to avoid it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Split handling of listening TCP sockets into their own handler	David Gibson	2023-08-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_sock_handler() handles both listening TCP sockets, and connected TCP sockets, but what it needs to do in those cases has essentially nothing in common. Therefore, give listening sockets their own epoll_type value and dispatch directly to their own handler from the top level. Furthermore, the two handlers need essentially entirely different information from the reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate the port for the listening socket, but that's not the same meaning. So, switch listening sockets to their own reference type which we can lay out as we please. That lets us remove the listen and outbound fields from the normal (connected) tcp_epoll_ref, reducing it to just the connection table index. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	epoll: Generalize epoll_ref to cover things other than sockets	David Gibson	2023-08-13	1	-7/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The epoll_ref type includes fields for the IP protocol of a socket, and the socket fd. However, we already have a few things in the epoll which aren't protocol sockets, and we may have more in future. Rename these fields to an abstract "fd type" and file descriptor for more generality. Similarly, rather than using existing IP protocol numbers for the type, introduce our own number space. For now these just correspond to the supported protocols, but we'll expand on that in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Make ns_enter() a void function and report setns() errors	David Gibson	2023-08-04	1	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ns_enter() returns an integer... but it's always zero. If we actually fail the function doesn't return. Therefore it makes more sense for this to be a function returning void, and we can remove the cases where we pointlessly checked its return value. In addition ns_enter() is usually called from an ephemeral thread created by NS_CALL(). That means that the exit(EXIT_FAILURE) there usually won't be reported (since NS_CALL() doesn't wait() for the thread). So, use die() instead to print out some information in the unlikely event that our setns() here does fail. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Use C11 anonymous members to make poll refs less verbose to use	David Gibson	2023-08-04	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	union epoll_ref has a deeply nested set of structs and unions to let us subdivide it into the various different fields we want. This means that referencing elements can involve an awkward long string of intermediate fields. Using C11 anonymous structs and unions lets us do this less clumsily. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util, conf: Add and use ns_is_init() helper	Stefano Brivio	2023-05-23	1	-0/+25
\| \| \| \| \| \| \| \| \| \|	We'll need this in isolate_initial(). While at it, don't rely on BUFSIZ: the earlier issue we had with musl reminded me it's not a magic "everything will fit" value. Size the read buffer to what we actually need from uid_map, and check for the final newline too, because uid_map is organised in lines. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Relicense to GPL 2.0, or any later version	Stefano Brivio	2023-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In practical terms, passt doesn't benefit from the additional protection offered by the AGPL over the GPL, because it's not suitable to be executed over a computer network. Further, restricting the distribution under the version 3 of the GPL wouldn't provide any practical advantage either, as long as the passt codebase is concerned, and might cause unnecessary compatibility dilemmas. Change licensing terms to the GNU General Public License Version 2, or any later version, with written permission from all current and past contributors, namely: myself, David Gibson, Laine Stump, Andrea Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp, util: Pass socket creation errors all the way up	Stefano Brivio	2023-03-09	1	-13/+18
\| \| \| \| \| \| \| \| \|	...starting from sock_l4(), pass negative error (errno) codes instead of -1. They will only be used in two commits from now, no functional changes intended here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	treewide: Fix header includes to build with musl	Chris Kuhn	2023-03-09	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Roughly inspired from a patch by Chris Kuhn: fix up includes so that we can build against musl: glibc is more lenient as headers generally include a larger amount of other headers. Compared to the original patch, I only included what was needed directly in C files, instead of adding blanket includes in local header files. It's a bit more involved, but more consistent with the current (not ideal) situation. Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	util: Add own prototype for __clone2() on ia642023_02_27.c538ee8	Stefano Brivio	2023-02-27	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	ia64 needs to use __clone2() as clone() is not available, but glibc doesn't export the prototype. Take it from clone(2) to avoid an implicit declaration: util.c: In function ‘do_clone’: util.c:512:16: warning: implicit declaration of function ‘__clone2’ [-Wimplicit-function-declaration] 512 \| return __clone2(fn, stack_area + stack_size / 2, stack_size / 2, \| ^~~~~~~~ Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Make assertions actually useful	David Gibson	2023-02-12	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are some places in passt/pasta which #include <assert.h> and make various assertions. If we hit these something has already gone wrong, but they're there so that we a useful message instead of cryptic misbehaviour if assumptions we thought were correct turn out not to be. Except.. the glibc implementation of assert() uses syscalls that aren't in our seccomp filter, so we'll get a SIGSYS before it actually prints the message. Work around this by adding our own ASSERT() implementation using our existing err() function to log the message, and an abort(). The abort() probably also won't work exactly right with seccomp, but once we've printed the message, dying with a SIGSYS works just as well as dying with a SIGABRT. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>