passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	make conf_ports() exit immediately after logging error	Laine Stump	2023-02-16	1	-24/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rather than having conf_ports() (possibly) log an error, and then letting the caller log the entire usage() message and exit, just log the errors and exit immediately (using die()). For some errors, conf_ports would previously not log any specific message, leaving it up to the user to determine the problem by guessing. We replace all of those silent returns with die() (logging a specific error), thus permitting us to make conf_ports() return void, which simplifies the caller. While modifying the two callers to conf_ports() to not check for a return value, we can further simplify the code by removing the check for a non-null optarg, as that is guaranteed to never happen (due to prior calls to getopt_long() with "argument required" for all relevant options - getopt_long() would have already caught this error). Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	eliminate most calls to usage() in conf()	Laine Stump	2023-02-16	1	-214/+122
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Nearly all of the calls to usage() in conf() occur immediately after logging a more detailed error message, and the fact that these errors are occuring indicates that the user has already seen the passt usage message (or read the manpage). Spamming the logfile with the complete contents of the usage message serves only to obscure the more detailed error message. The only time when the full usage message should be output is if the user explicitly asks for it with -h (or its synonyms) As a start to eliminating the excessive calls to usage(), this patch replaces most calls to err() followed by usage() with a call to die() instead. A few other usage() calls remain, but their removal involves bit more nuance that should be properly explained in separate commit messages. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	add die() to log an error message and exit with a single call	Laine Stump	2023-02-16	2	-5/+10
\| \| \| \| \| \| \| \| \| \| \| \| \|	Almost all occurences of err() are either immediately followed by exit(EXIT_FAILURE), usage(argv[0]) (which itself then calls exit(EXIT_FAILURE), or that is what's done immediately after returning from the function that calls err(). Modify the errfn macro so that its instantiations can include exit(EXIT_FAILURE) at the end, and use that to create a new function die() that will log an error and then exit. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	log to stderr until process is daemonized, even if a log file is set	Laine Stump	2023-02-16	2	-9/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Once a log file (specified on the commandline) is opened, the logging functions will stop sending error logs to stderr, which is annoying if passt has been started by another process that wants to collect error messages from stderr so it can report why passt failed to start. It would be much nicer if passt continued sending all log messages to stderr until it daemonizes itself (at which point the process that started passt can assume that it was started successfully). The system log mask is set to LOG_EMERG when the process starts, and we're already using that to do "special" logging during the period from process start until the log level requested on the commandline is processed (setting the log mask to something else). This period almost matches with "the time before the process is daemonized"; if we just delay setting the log mask a tiny bit, then it will match exactly, and we can use it to determine if we need to send log messages to stderr even when a log file has been specified and opened. This patch delays the setting of the log mask until immediately before the call to __daemon(). It also modifies logfn() slightly, so that it will log to stderr any time log mask is LOG_EMERG, even if a log file has been opened. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Fedora 32-35 have moved to the archives	David Gibson	2023-02-14	1	-4/+4
\| \| \| \| \| \| \| \|	Fedora 32-35 are now old enough that they're not on all mirrors. Fetch them from the archive server instead. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test: Update location for Debian ppc64 images	David Gibson	2023-02-14	2	-2/+5
\| \| \| \| \| \| \| \|	The current debian cloud images no longer include ppc64. Change to using the latest snapshot which does include ppc64. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Improve handling of fallback if socket pool is empty on new splice	David Gibson	2023-02-14	3	-60/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When creating a new spliced connection, we need to get a socket in the other ns from the originating one. To avoid excessive ns switches we usually get these from a pool refilled on a timer. However, if the pool runs out we need a fallback. Currently that's done by passing -1 as the socket to tcp_splice_connnect() and running it in the target ns. This means that tcp_splice_connect() itself needs to have different cases depending on whether it's given an existing socket or not, which is a separate concern from what it's mostly doing. We change it to require a suitable open socket to be passed in, and ensuring in the caller that we have one. This requires adding the fallback paths to the caller, tcp_splice_new(). We use slightly different approaches for a socket in the init ns versus the guest ns. This also means that we no longer need to run tcp_splice_connect() itself in the guest ns, which allows us to remove a bunch of boilerplate code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Split pool lookup from creating new sockets in tcp_conn_new_sock()	David Gibson	2023-02-14	3	-31/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_conn_new_sock() first looks for a socket in a pre-opened pool, then if that's empty creates a new socket in the init namespace. Both parts of this are duplicated in other places: the pool lookup logic is duplicated in tcp_splice_new(), and the socket opening logic is duplicated in tcp_sock_refill_pool(). Split the function into separate parts so we can remove both these duplications. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Move socket pool declarations around	David Gibson	2023-02-14	4	-48/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_splice.c has some explicit extern declarations to access the socket pools. This is pretty dangerous - if we changed the type of these variables in tcp.c, we'd have tcp.c and tcp_splice.c using the same memory in different ways with no compiler error. So, move the extern declarations to tcp_conn.h so they're visible to both tcp.c and tcp_splice.c, but not the rest of pasta. In fact the pools for the guest namespace are necessarily only used by tcp_splice.c - we have no sockets on the guest side if we're not splicing. So move those declarations and the functions that deal exclusively with them to tcp_splice.c Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Split init and ns cases for tcp_sock_refill()	David Gibson	2023-02-14	1	-32/+21
\| \| \| \| \| \| \| \| \| \|	With the creation of the tcp_sock_refill_pool() helper, the ns==true and ns==false paths for tcp_sock_refill() now have almost nothing in common. Split the two versions into tcp_sock_refill_init() and tcp_sock_refill_ns() functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Make a helper to refill each socket pool	David Gibson	2023-02-14	1	-30/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_sock_refill() contains two near-identical loops to refill the IPv4 and IPv6 socket pools. In addition, if we get an error on the IPv4 pool we exit early and won't attempt to refill the IPv6 pool. At least theoretically, these are independent from each other and there's value to filling up either pool without the other. So, there's no strong reason to give up on one because the other failed. Address both of these with a helper function 'tcp_sock_refill_pool()' to refill a single given pool. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Makefile: Explict int type in FALLOC_FL_COLLAPSE_RANGE probe	Florian Weimer	2023-02-14	1	-1/+1
\| \| \| \| \| \| \| \|	Future compilers will not support implicit ints by default, causing the probe to always fail. Link: https://bugs.passt.top/show_bug.cgi?id=42 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/pasta_options: Ignore failures on shell 'exit'	Stefano Brivio	2023-02-13	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On shell 'exit' commands, running shells from pasta, we might get: Cannot set tty process group (No such process) as some TTY devices might be unaccessible. This is harmless, but after commit "pasta: propagate exit code from child command", we'll get test failures there, at least with dash. Ignore those explicitly with a ugly workaround: we can't simply do something like: exit \|\| : because the failure is reported by the shell itself once it exits, regardless of the command evaluation. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: propagate exit code from child command	Paul Holzinger	2023-02-12	1	-2/+10
\| \| \| \| \| \| \| \| \| \| \| \|	Exits codes are very useful for scripts, when the pasta child execvp() call fails with ENOENT that parent should also exit with > 0. In short the parent should always exit with the code from the child to make it useful in scripts. It is easy to test with: `pasta -- bash -c "exit 3"; echo $?` Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: correctly exit when execvp() fails	Paul Holzinger	2023-02-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	By default clone() will create a child that does not send SIGCHLD when the child exits. The caller has to specifiy the SIGNAL it should get in the flag bitmask. see clone(2) under "The child termination signal" This fixes the problem where pasta would not exit when the execvp() call failed, i.e. when the command does not exists. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: do not leak netlink sock into child	Paul Holzinger	2023-02-12	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When spawning a child command with pasta command... pasta should not leak fds that it opened. Only the fds that were already open should be given to the child. Run `pasta --config-net -- ls -l /proc/self/fd` from a terminal where only stdin/out/err are open. The fd 3 was opend by ls to read the /proc/self/fd dir. But fd 5 is the netlink socket that was opend in pasta. To prevent such a leak we will open the socket with SOCK_CLOEXEC. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	Make assertions actually useful	David Gibson	2023-02-12	7	-20/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are some places in passt/pasta which #include <assert.h> and make various assertions. If we hit these something has already gone wrong, but they're there so that we a useful message instead of cryptic misbehaviour if assumptions we thought were correct turn out not to be. Except.. the glibc implementation of assert() uses syscalls that aren't in our seccomp filter, so we'll get a SIGSYS before it actually prints the message. Work around this by adding our own ASSERT() implementation using our existing err() function to log the message, and an abort(). The abort() probably also won't work exactly right with seccomp, but once we've printed the message, dying with a SIGSYS works just as well as dying with a SIGABRT. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer	Stefano Brivio	2023-02-12	1	-14/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	David reports that TCP transfers might stall, especially with smaller socket buffer sizes, because we reset the ACK_FROM_TAP_DUE flag, in tcp_tap_handler(), whenever we receive an ACK segment, regardless of its sequence number and the fact that we might still be waiting for one. This way, we might fail to re-transmit frames on ACK timeouts. We need, instead, to: - indicate with the @retrans field only re-transmissions for the same data sequences. If we make progress, it should be reset, given that it's used to abort a connection when we exceed a given number of re-transmissions for the same data - unset the ACK_FROM_TAP_DUE flag if and only if the acknowledged sequence is the same as the last one we sent, as suggested by David - keep it set otherwise, if progress was done but not all the data we sent was acknowledged, and update the expiration of the ACK timeout Add a new helper for these purposes, tcp_update_seqack_from_tap(). To extend the ACK timeout, the new helper sets the ACK_FROM_TAP_DUE flag, even if it was already set, and conn_flag_do() triggers a timer update. This part should be revisited at a later time, because, strictly speaking, ACK_FROM_TAP_DUE isn't a flag anymore. One possibility might be to introduce another connection attribute for events affecting timer deadlines. Reported-by: David Gibson <david@gibson.dropbear.id.au> Link: https://bugs.passt.top/show_bug.cgi?id=41 Suggested-by: David Gibson <david@gibson.dropbear.id.au> Fixes: be5bbb9b0681 ("tcp: Rework timers to use timerfd instead of periodic bitmap scan") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Send frames after the first one in tap_send_frames_pasta()	Stefano Brivio	2023-02-12	1	-1/+2
\| \| \| \| \| \| \|	...instead of repeatedly sending out the first one in iov. Fixes: e21ee41ac35a ("tcp: Combine two parts of pasta tap send path together") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Wait for tap to be set up before spawning command	Stefano Brivio	2023-02-12	3	-1/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adapted from a patch by Paul Holzinger: when pasta spawns a command, operating without a pre-existing user and network namespace, it needs to wait for the tap device to be configured and its handler ready, before the command is actually executed. Otherwise, something like: pasta --config-net nslookup passt.top usually fails as the nslookup command is issued before the network interface is ready. We can't adopt a simpler approach based on SIGSTOP and SIGCONT here: the child runs in a separate PID namespace, so it can't send SIGSTOP to itself as the kernel sees the child as init process and blocks the delivery of the signal. We could send SIGSTOP from the parent, but this wouldn't avoid the possible condition where the child isn't ready to wait for it when the parent sends it, also raised by Paul -- and SIGSTOP can't be blocked, so it can never be pending. Use SIGUSR1 instead: mask it before clone(), so that the child starts with it blocked, and can safely wait for it. Once the parent is ready, it sends SIGUSR1 to the child. If SIGUSR1 is sent before the child is waiting for it, the kernel will queue it for us, because it's blocked. Reported-by: Paul Holzinger <pholzing@redhat.com> Fixes: 1392bc5ca002 ("Allow pasta to take a command to execute") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use tap_send_frames()	David Gibson	2023-01-23	4	-166/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To send frames on the tap interface, the UDP uses a fairly complicated two level batching. First multiple frames are gathered into a single "message" for the qemu stream socket, then multiple messages are send with sendmmsg(). We now have tap_send_frames() which already deals with sending a number of frames, including batching and handling partial sends. Use that to considerably simplify things. This does make a couple of behavioural changes: * We used to split messages to keep them under 32kiB (except when a single frame was longer than that). The comments claim this is needed to stop qemu from closing the connection, but we don't have any equivalent logic for TCP. I wasn't able to reproduce the problem with this series, although it was apparently easy to reproduce earlier. My suspicion is that there was never an inherent need to keep messages small, however with larger messages (and default kernel buffer sizes) the chances of needing more than one resend for partial send()s is greatly increased. We used not to correctly handle that case of multiple resends, but now we do. * Previously when we got a partial send on UDP, we would resend the remainder of the entire "message", including multiple frames. The common code now only resends the remainder of a single frame, simply dropping any frames which weren't even partially sent. This is what TCP always did and is probably a better idea for UDP too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Improve handling of partial frame sends	David Gibson	2023-01-23	1	-13/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In passt mode, when writing frames to the qemu socket, we might get a short send. If we ignored this and carried on, the qemu socket would get out of sync, because the bytes we actually sent wouldn't correspond to the length header we already sent. tap_send_frames_passt() handles that by doing a a blocking send to complete the message, but it has a few flaws: * We only attempt to resend once: although it's unlikely in practice, nothing prevents the blocking send() from also being short * We print a debug error if send() returns non-zero.. but send() returns the number of bytes sent, so we actually want it to return the length of the remaining data. Correct those flaws and also be a bit more thorough about reporting problems here. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use abstracted tap header	David Gibson	2023-01-23	2	-62/+24
\| \| \| \| \| \| \| \| \|	Update the UDP code to use the tap layer abstractions for initializing and updating the L2 and lower headers. This will make adding other tap backends in future easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Use different io vector bases depending on tap type	David Gibson	2023-01-23	2	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \|	Currently tap_send_frames() expects the frames it is given to include the vnet_len field, even in pasta mode which doesn't use it (although it need not be initialized in that case). To match, tap_iov_base() and tap_iov_len() construct the frame in that way. This will inconvenience future changes, so alter things to set the buffers to include just the frame needed by the tap backend type. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Use abstracted tap header	David Gibson	2023-01-23	1	-55/+30
\| \| \| \| \| \| \| \| \|	Update the TCP code to use the tap layer abstractions for initializing and updating the L2 and lower headers. This will make adding other tap backends in future easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Add "tap headers" abstraction	David Gibson	2023-01-23	2	-0/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently both the TCP and UDP code need to deal in various places with the details of the L2 headers, and also the tap-specific "vnet_len" header. This makes abstracting the tap interface to new backends (e.g. vhost-user or tun) more difficult. To improve this abstraction, create a new 'tap_hdr' structure which represents both L2 (always Ethernet at the moment, but might be vary in future) and any additional tap specific headers (such as the qemu socket's vnet_len field). Provide helper functions and macros to initialize, update and use it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Consolidate calculation of total frame size	David Gibson	2023-01-23	1	-19/+16
\| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_l2_buf_fill_headers() returns the size of the generated frame including the ethernet header. The caller then adds on the size of the vnet_len field to get the total frame size to be passed to the tap device. Outside the tap code, though, we never care about the ethernet header size only the final total size we need to put into an iovec. So, consolidate the total frame size calculation within tcp_l2_buf_fill_headers(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Remove redundant and incorrect initialization from *_iov_init()	David Gibson	2023-01-23	1	-6/+2
\| \| \| \| \| \| \| \| \| \|	tcp_sock[46]_iov_init() initialize the length of each iovec buffer to MSS_DEFAULT. That will always be overwritten before use in tcp_data_to_tap, so it's redundant. It also wasn't correct, because it didn't correctly account for the header lengths in all cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Parameterize ethernet header initializer macro	David Gibson	2023-01-23	3	-15/+8
\| \| \| \| \| \| \| \| \|	We have separate IPv4 and IPv6 versions of a macro to construct an initializer for ethernet headers. However, now that we have htons_constant it's easy to simply paramterize this with the ethernet protocol number. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Use named field initializers in iov_init functions	David Gibson	2023-01-23	2	-23/+16
\| \| \| \| \| \| \| \| \| \|	Both the TCP and UDP iov_init functions have some large structure literals defined in "field order" style. These are pretty hard to read since it's not obvious what value corresponds to what field. Use named field style initializers instead to make this clearer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	util: Introduce hton*_constant() in place of #ifdefs	David Gibson	2023-01-23	2	-63/+24
\| \| \| \| \| \| \| \| \| \| \|	We have several places where we have fairly ugly #ifdefs on __BYTE_ORDER where we need network order values in a constant expression (so we can't use htons() or htonl()). We can do this more cleanly by using a single __BYTE_ORDER ifdef to define htons_constant() and htonl_constant() macros, then using those in all the other places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap, tcp: Move tap send path to tap.c	David Gibson	2023-01-23	3	-81/+89
\| \| \| \| \| \| \| \| \| \|	The functions which do the final steps of sending TCP packets on through the tap interface - tcp_l2_buf_flush*() - no longer have anything that's actually specific to TCP in them, other than comments and names. Move them all to tap.c. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Combine two parts of pasta tap send path together	David Gibson	2023-01-23	1	-22/+18
\| \| \| \| \| \| \| \| \| \| \|	tcp_l2_buf_flush() open codes the loop across each frame in a group, but but calls tcp_l2_buf_write_one() to send each frame to the pasta tuntap device. Combine these two pasta-specific operations into tcp_l2_buf_flush_pasta() which is a little cleaner and will enable further cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Improve interface to tcp_l2_buf_flush()	David Gibson	2023-01-23	1	-37/+26
\| \| \| \| \| \| \| \| \| \|	Currently this takes a msghdr, but the only thing we actually care about in there is the io vector. Make it take an io vector directly. We also have a weird side effect of zeroing @buf_used. Just pass this by value and zero it in the caller instead. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Don't compute total bytes in a message until we need it	David Gibson	2023-01-23	1	-35/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp[46]_l2_buf_bytes keep track of the total number of bytes we have queued to send to the tap interface. tcp_l2_buf_flush_passt() uses this to determine if sendmsg() has sent all the data we requested, or whether we need to resend a trailing portion. However, the logic for finding where we're up to in the case of a short sendmsg() can equally well tell whether we've had one at all, without knowing the total number in advance. This does require an extra loop after each sendmsg(), but it's doing simple arithmetic on values we've already been accessing, and it leads to overall simpler code. tcp[46]_l2_flags_buf_bytes were being calculated, but never used for anything, so simply remove them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp: Combine two parts of passt tap send path together	David Gibson	2023-01-23	1	-8/+12
\| \| \| \| \| \| \| \| \| \| \|	tcp_l2_buf_flush() open codes the "primary" send of message to the passt tap interface, but calls tcp_l2_buf_flush_part() to handle the case of a short send. Combine these two passt-specific operations into tcp_l2_buf_flush_passt() which is a little cleaner and will enable furrther cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pcap: Replace pcapm() with pcap_multiple()	David Gibson	2023-01-23	3	-11/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pcapm() captures multiple frames from a msghdr, however the only thing it cares about in the msghdr is the list of buffers, where it assumes there is one frame to capture per buffer. That's what we want for its single caller but it's not the only obvious choice here (one frame per msghdr would arguably make more sense in isolation). In addition pcapm() has logic that only makes sense in the context of the passt specific path its called from: it skips the first 4 bytes of each buffer, because those have the qemu vnet_len rather than the frame proper. Make this clearer by replacing pcapm() with pcap_multiple() which more explicitly takes one struct iovec per frame, and parameterizes how much of each buffer to skip (i.e. the offset of the frame within the buffer). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pcap: Introduce pcap_frame() helper	David Gibson	2023-01-23	1	-38/+38
\| \| \| \| \| \| \| \| \| \| \| \|	pcap(), pcapm() and pcapmm() duplicate some code, for the actual writing to the capture file. The purpose of pcapm() and pcapmm() not calling pcap() seems to be to avoid repeatedly calling gettimeofday() and to avoid printing errors for every packet in a batch if there's a problem. We can accomplish that while still sharing code by adding a new helper which takes the packet timestamp as a parameter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't use separate sockets to listen for spliced packets	David Gibson	2023-01-13	1	-40/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, when ports are forwarded inbound in pasta mode, we open two sockets for incoming traffic: one listens on the public IP address and will forward packets to the tuntap interface. The other listens on localhost and forwards via "splicing" (resending directly via sockets in the ns). Now that we've improved the logic about whether we "splice" any individual packet, we don't need this. Instead we can have a single socket bound to 0.0.0.0 or ::, marked as able to splice and udp_sock_handler() will deal with each packet as appropriate. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Decide whether to "splice" per datagram rather than per socket	David Gibson	2023-01-13	2	-20/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we have special sockets for receiving datagrams from locahost which can use the optimized "splice" path rather than going across the tap interface. We want to loosen this so that sockets can receive sockets that will be forwarded by both the spliced and non-spliced paths. To do this, we alter the meaning of the @splice bit in the reference to mean that packets receieved on this socket can be spliced, not that they will be spliced. They'll only actually be spliced if they come from 127.0.0.1 or ::1. We can't (for now) remove the splice bit entirely, unlike with TCP. Our gateway mapping means that if the ns initiates communication to the gw address, we'll translate that to target 127.0.0.1 on the host side. Reply packets will therefore have source address 127.0.0.1 when received on the host, but these need to go via the tap path where that will be translated back to the gateway address. We need the @splice bit to distinguish that case from packets going from localhost to a port mapped explicitly with -u which should be spliced. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Unify udp_sock_handler_splice() with udp_sock_handler()	David Gibson	2023-01-13	1	-60/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	These two functions now have a very similar structure, and their first part (calling recvmmsg()) is functionally identical. So, merge the two functions into one. This does have the side effect of meaning we no longer receive multiple packets at once for splice (we already didn't for tap). This does hurt throughput for small spliced packets, but improves it for large spliced packets and tap packets. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Pre-populate msg_names with local address	David Gibson	2023-01-13	2	-22/+25
\| \| \| \| \| \| \| \| \| \| \|	udp_splice_namebuf is now used only for spliced sending, and so it is only ever populated with the localhost address, either IPv4 or IPv6. So, replace the awkward initialization in udp_sock_handler_splice() with statically initialized versions for IPv4 and IPv6. We then just need to update the port number in udp_sock_handler_splice(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Don't handle tap receive batch size calculation within a #define	David Gibson	2023-01-13	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	UDP_MAX_FRAMES gives the maximum number of datagrams we'll ever handle as a batch for sizing our buffers and control structures. The subtly different UDP_TAP_FRAMES gives the maximum number of datagrams we'll actually try to receive at once for tap packets in the current configuration. This depends on the mode, meaning that the macro has a non-obvious dependency on the usual 'c' context variable being available. We only use it in one place, so it makes more sense to open code this. Add an explanatory comment while we're there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split receive from preparation and send in udp_sock_handler()	David Gibson	2023-01-13	1	-27/+52
\| \| \| \| \| \| \| \| \| \| \|	The receive part of udp_sock_handler() and udp_sock_handler_splice() is now almost identical. In preparation for merging that, split the receive part of udp_sock_handler() from the part preparing and sending the frames for sending on the tap interface. The latter goes into a new udp_tap_send() function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split sending to passt tap interface into separate function	David Gibson	2023-01-13	1	-58/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The last part of udp_sock_handler() does the actual sending of frames to the tap interface. For pasta that's just a call to udp_tap_send_pasta() but for passt, it's moderately complex and open coded. For symmetry, move the passt send path into its own function, udp_tap_send_passt(). This will make it easier to abstract the tap interface in future (e.g. when we want to add vhost-user). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Move sending pasta tap frames to the end of udp_sock_handler()	David Gibson	2023-01-13	1	-19/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	udp_sock_handler() has a surprising difference in flow between pasta and passt mode: For pasta we send each frame to the tap interface as we prepare it. For passt, though, we prepare all the frames, then send them with a single sendmmsg(). Alter the pasta path to also prepare all the frames, then send them at the end. We already have a suitable data structure for the passt case. This will make it easier to abstract out the tap backend difference in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	test/perf/pasta_tcp: Add host to namespace cases for traffic via tap	Stefano Brivio	2023-01-05	1	-0/+57
\| \| \| \| \| \| \| \| \| \|	Similarly to UDP cases, these were missing as it wasn't clear, when the other tests were introduced, if using the global address of a namespace, from the host, should have resulted in connections being routed via the tap interface. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	tcp: Explicitly check option length field values in tcp_opt_get()	Stefano Brivio	2023-01-05	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	Reported by Coverity (CWE-606, Untrusted loop bound), and actually harmless because we'll exit the option-scanning loop if the remaining length is not enough for a new option, instead of reading past the header. In any case, it looks like a good idea to explicitly check for reasonable values of option lengths. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	test/perf/pasta_udp: Add host to namespace cases for traffic via tap	Stefano Brivio	2023-01-05	1	-0/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	These were missing as it wasn't clear, when the other tests were introduced, if using the global address of a namespace, from the host, should have resulted in traffic being routed via the tap interface (as opposed to the loopback interface). We now clarified that's actually the case. Use same values and thresholds as the tests for loopback traffic, as throughput figures currently indicate there isn't much difference. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
*	udp: Factor out control structure management from udp_sock_fill_data_v[46]	David Gibson	2022-12-06	1	-68/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The main purpose of udp_sock_fill_data_v[46]() is to construct the IP, UDP and other headers we'll need to forward data onto the tap interface. In addition they update the control structures (iovec and mmsghdr) we'll need to send the messages, and in the case of pasta actually sends it. This leads the control structure management and the send itself awkwardly split between udp_sock_fill_data_v[46]() and their caller udp_sock_handler(). In addition, this tail part of udp_sock_fill_datav[46] is essentially common between the IPv4 and IPv6 versions, apart from which control array we're working on. Clean this up by reducing these functions to just construct the headers and renaming them to udp_update_hdr[46]() accordingly. The control structure updates are now all in the caller, and common for IPv4 and IPv6. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>