passt - Plug A Simple Socket Transport

	Commit message (Collapse)	Author	Age	Files	Lines
*	treewide: Mark constant references as const	Stefano Brivio	2022-03-29	1	-16/+18
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	treewide: Packet abstraction with mandatory boundary checks	Stefano Brivio	2022-03-29	1	-20/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement a packet abstraction providing boundary and size checks based on packet descriptors: packets stored in a buffer can be queued into a pool (without storage of its own), and data can be retrieved referring to an index in the pool, specifying offset and length. Checks ensure data is not read outside the boundaries of buffer and descriptors, and that packets added to a pool are within the buffer range with valid offset and indices. This implies a wider rework: usage of the "queueing" part of the abstraction mostly affects tap_handler_{passt,pasta}() functions and their callees, while the "fetching" part affects all the guest or tap facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6 handlers. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp, util: Enforce 24-bit limit on socket numbers	Stefano Brivio	2022-03-29	1	-0/+7
\| \| \| \| \| \| \|	This should never happen, but there are no formal guarantees: ensure socket numbers are below SOCKET_MAX. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use flags for local, loopback, and configured unicast binds	Stefano Brivio	2022-03-28	1	-25/+23
\| \| \| \| \| \| \| \| \| \| \|	There's no value in keeping a separate timestamp for activity and for aging of local binds, given that they have the same timeout. Reduce that to a single timestamp, with a flag indicating the local bind. Also use flags instead of separate int fields for loopback and configured unicast address usage as source address. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Split buffer queueing/writing parts of udp_sock_handler()	Stefano Brivio	2022-03-28	1	-171/+193
\| \| \| \| \| \| \| \| \| \|	...it became too hard to follow: split it off to udp_sock_fill_data_v{4,6}. While at it, use IN6_ARE_ADDR_EQUAL(a, b), courtesy of netinet/in.h, instead of open-coded memcmp(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Drop _splice from recv, send, sendto static buffer names	Stefano Brivio	2022-03-28	1	-29/+23
\| \| \| \| \| \| \|	It's already implied by the fact they don't have "l2" in their names, and dropping it improves readability a bit. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Explicitly initialise sin6_scope_id and sin_zero in sockaddr_in{,6}	Stefano Brivio	2022-02-25	1	-0/+2
\| \| \| \| \| \| \|	Not functionally needed, but gcc versions 7 to 9 (at least) will issue a warning otherwise. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Receive batching doesn't pay off when writing single frames to tap	Stefano Brivio	2022-02-21	1	-16/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In pasta mode, when we get data from sockets and write it as single frames to the tap device, we batch receive operations considerably, and then (conceptually) split the data in many smaller writes. It looked like an obvious choice, but performance is actually better if we receive data in many small frame-sized recvmsg()/recvmmsg(). The syscall overhead with the previous behaviour, observed by perf, comes predominantly from write operations, but receiving data in shorter chunks probably improves cache locality by a considerable amount. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Allow loopback connections from host using configured unicast address	Stefano Brivio	2022-02-21	1	-2/+18
\| \| \| \| \| \| \| \| \| \| \| \|	Likely for testing purposes only: allow connections from host to guest or namespace using, as connection target, the configured, possibly global unicast address. In this case, we have to map the destination address to a link-local address, and for port-based tracked responses, the source address needs to be again the unicast address: not loopback, not link-local. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, udp: Introduce basic DNS forwarding	Stefano Brivio	2022-02-21	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For compatibility with libslirp/slirp4netns users: introduce a mechanism to map, in the UDP routines, an address facing guest or namespace to the first IPv4 or IPv6 address resulting from configuration as resolver. This can be enabled with the new --dns-forward option. This implies that sourcing and using DNS addresses and search lists, passed via command line or read from /etc/resolv.conf, is not bound anymore to DHCP/DHCPv6/NDP usage: for example, pasta users might just want to use addresses from /etc/resolv.conf as mapping target, while not passing DNS options via DHCP. Reflect this in all the involved code paths by differentiating DHCP/DHCPv6/NDP usage from DNS configuration per se, and in the new options --dhcp-dns, --dhcp-search for pasta, and --no-dhcp-dns, --no-dhcp-search for passt. This should be the last bit to enable substantial compatibility between slirp4netns.sh and slirp4netns(1): pass the --dns-forward option from the script too. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt, pasta: Namespace-based sandboxing, defer seccomp policy application	Stefano Brivio	2022-02-21	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To reach (at least) a conceptually equivalent security level as implemented by --enable-sandbox in slirp4netns, we need to create a new mount namespace and pivot_root() into a new (empty) mountpoint, so that passt and pasta can't access any filesystem resource after initialisation. While at it, also detach IPC, PID (only for passt, to prevent vulnerabilities based on the knowledge of a target PID), and UTS namespaces. With this approach, if we apply the seccomp filters right after the configuration step, the number of allowed syscalls grows further. To prevent this, defer the application of seccomp policies after the initialisation phase, before the main loop, that's where we expect bad things to happen, potentially. This way, we get back to 22 allowed syscalls for passt and 34 for pasta, on x86_64. While at it, move #syscalls notes to specific code paths wherever it conceptually makes sense. We have to open all the file handles we'll ever need before sandboxing: - the packet capture file can only be opened once, drop instance numbers from the default path and use the (pre-sandbox) PID instead - /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection of bound ports in pasta mode, are now opened only once, before sandboxing, and their handles are stored in the execution context - the UNIX domain socket for passt is also bound only once, before sandboxing: to reject clients after the first one, instead of closing the listening socket, keep it open, accept and immediately discard new connection if we already have a valid one Clarify the (unchanged) behaviour for --netns-only in the man page. To actually make passt and pasta processes run in a separate PID namespace, we need to unshare(CLONE_NEWPID) before forking to background (if configured to do so). Introduce a small daemon() implementation, __daemon(), that additionally saves the PID file before forking. While running in foreground, the process itself can't move to a new PID namespace (a process can't change the notion of its own PID): mention that in the man page. For some reason, fork() in a detached PID namespace causes SIGTERM and SIGQUIT to be ignored, even if the handler is still reported as SIG_DFL: add a signal handler that just exits. We can now drop most of the pasta_child_handler() implementation, that took care of terminating all processes running in the same namespace, if pasta started a shell: the shell itself is now the init process in that namespace, and all children will terminate once the init process exits. Issuing 'echo $$' in a detached PID namespace won't return the actual namespace PID as seen from the init namespace: adapt demo and test setup scripts to reflect that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp, util: Fixes for bitmap handling on big-endian, casts	Stefano Brivio	2022-01-26	1	-2/+2
\| \| \| \| \| \| \| \|	Bitmap manipulating functions would otherwise refer to inconsistent sets of bits on big-endian architectures. While at it, fix up a couple of casts. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Drop <linux/ipv6.h> include, carry own ipv6hdr and opt_hdr definitions	Stefano Brivio	2022-01-26	1	-2/+0
\| \| \| \| \| \| \|	This is the only remaining Linux-specific include -- drop it to avoid clang-tidy warnings and to make code more portable. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Add cppcheck target, test, and address resulting warnings	Stefano Brivio	2021-10-21	1	-10/+8
\| \| \| \| \| \| \|	...mostly false positives, but a number of very relevant ones too, in tcp_get_sndbuf(), tcp_conn_from_tap(), and siphash PREAMBLE(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Avoid static initialiser for udp{4,6}_l2_buf	Stefano Brivio	2021-10-21	1	-18/+23
\| \| \| \| \| \|	With the new UDP_TAP_FRAMES value, the binary size grows considerably. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fix maximum payload size calculation for IPv4 buffers, bump UDP_TAP_FRAMES	Stefano Brivio	2021-10-21	1	-2/+3
\| \| \| \| \| \| \| \|	The issue with a higher UDP_TAP_FRAMES was actually coming from a payload size the guest couldn't digest. Fix that, and bump UDP_TAP_FRAMES back to 128. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Fix build with gcc 7, use std=c99, enable some more Clang checkers	Stefano Brivio	2021-10-21	1	-43/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unions and structs, you all have names now. Take the chance to enable bugprone-reserved-identifier, cert-dcl37-c, and cert-dcl51-cpp checkers in clang-tidy. Provide a ffsl() weak declaration using gcc built-in. Start reordering includes, but that's not enough for the llvm-include-order checker yet. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	ndp, dhcpv6, tcp, udp: Always use link-local as source if gateway isn't	Stefano Brivio	2021-10-20	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \|	This shouldn't happen on any sane configuration, but I just met an example of that: the default IPv6 gateway on the host is configured with a global unicast address, we use that as source for RA, DHCPv6 replies, and the guest ignores it. Same later on if we talk TCP or UDP and the guest has no idea where that address comes from. Use our link-local address in case the gateway address is global. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Add clang-tidy Makefile target and test, take care of warnings	Stefano Brivio	2021-10-20	1	-9/+9
\| \| \| \| \| \| \|	Most are just about style and form, but a few were actually serious mistakes (NDP-related). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Address warnings from Clang's scan-build	Stefano Brivio	2021-10-20	1	-4/+11
\| \| \| \| \| \|	All false positives so far. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Address gcc 11 warnings	Stefano Brivio	2021-10-20	1	-3/+5
\| \| \| \| \| \| \| \| \|	A mix of unchecked return values, a missing permission mask for open(2) with O_CREAT, and some false positives from -Wstringop-overflow and -Wmaybe-uninitialized. Reported-by: Martin Hauke <mardnh@gmx.de> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: drop bogus udp_tap_map ts assignment	Stefan Hajnoczi	2021-10-15	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The 'ts' field is a timestamp so assigning the socket file descriptor is incorrect. There is no actual bug because the current time is assigned just a few lines later: udp_tap_map[V4][src].sock = s; udp_tap_map[V4][src].ts = s; ^^^^^^^^^^^ bogus ^^^^^^^^^^ bitmap_set(udp_act[V4][UDP_ACT_TAP], src); } udp_tap_map[V4][src].ts = now->tv_sec; ^^^^^^^^^^^^^^^ correct ^^^^^^^^^^^^^^ Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	conf, tcp, udp: Add --no-map-gw to disable mapping gateway address to host	Stefano Brivio	2021-10-14	1	-2/+2
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt, pasta: Add seccomp support	Stefano Brivio	2021-10-14	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	List of allowed syscalls comes from comments in the form: #syscalls <list> for syscalls needed both in passt and pasta mode, and: #syscalls:pasta <list> #syscalls:passt <list> for syscalls specifically needed in pasta or passt mode only. seccomp.sh builds a list of BPF statements from those comments, prefixed by a binary search tree to keep lookup fast. While at it, clean up a bit the Makefile using wildcards. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	pasta: Allow specifying paths and names of namespaces	Giuseppe Scrivano	2021-10-07	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Based on a patch from Giuseppe Scrivano, this adds the ability to: - specify paths and names of target namespaces to join, instead of a PID, also for user namespaces, with --userns - request to join or create a network namespace only, without entering or creating a user namespace, with --netns-only - specify the base directory for netns mountpoints, with --nsrun-dir Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> [sbrivio: reworked logic to actually join the given namespaces when they're not created, implemented --netns-only and --nsrun-dir, updated pasta demo script and man page] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tap: Completely de-serialise input message batches	Stefano Brivio	2021-09-27	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	Until now, messages would be passed to protocol handlers in a single batch only if they happened to be dequeued in a row. Packets interleaved between different connections would result in multiple calls to the same protocol handler for a single connection. Instead, keep track of incoming packet descriptors, arrange them in sequences, and call protocol handlers only as we completely sorted input messages in batches. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Switch to new socket message after 32KiB instead of 64KiB	Stefano Brivio	2021-09-27	1	-2/+2
\| \| \| \| \| \| \|	For some reason, this measurably improves performance with qemu and virtio-net. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Decrease UDP_TAP_FRAMES to 16	Stefano Brivio	2021-09-27	1	-1/+1
\| \| \| \| \| \| \|	Similarly to the decrease in TCP_TAP_FRAMES, this improves fairness, with a very small impact on performance. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Reset iov_base after sending partial message on sendmmsg() failure	Stefano Brivio	2021-09-09	1	-0/+2
\| \| \| \| \| \| \|	We set the length while processing messges, but the starting address is pre-initialised. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fix comparison of seen IPv4 address for local connections	Stefano Brivio	2021-09-09	1	-1/+2
\| \| \| \| \| \|	c->addr4_seen is stored in network order. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fix retry mechanism on partial sendmmsg()	Stefano Brivio	2021-09-09	1	-3/+3
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Restore usage of gateway for guest to connect to local host	Stefano Brivio	2021-09-01	1	-6/+6
\| \| \| \| \| \| \|	This went lost in a recent rework: if the guest wants to connect directly to the host, it can use the address of the default gateway. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Handle partial failure in sendmmsg() to UNIX domain socket	Stefano Brivio	2021-09-01	1	-20/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Similarly to the handling introduced by commit "tcp: Proper error handling for sendmmsg() to UNIX domain socket" for TCP, we need to deal with partial sendmmsg() failures for UDP as well. Here, we can lose messages, but we need to make sure that the last message is delivered completely, otherwise qemu will fail to reassemble further packets. For UDP, this is somewhat complicated by the fact that one message might include multiple datagrams, and we need to respect message boundaries: go through headers, and calculate what we need to re-send, if anything. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt, pasta: Introduce command-line options and port re-mapping	Stefano Brivio	2021-09-01	1	-79/+95
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Map source address to gateway for any traffic from 127.0.0.0/8	Stefano Brivio	2021-07-26	1	-3/+4
\| \| \| \| \| \|	...instead of just 127.0.0.1. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Allow binding ports in init namespace to both tap and loopback	Stefano Brivio	2021-07-26	1	-34/+71
\| \| \| \| \| \| \| \|	Traffic with loopback source address will be forwarded to the direct loopback connection in the namespace, and the tap interface is used for the rest. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	checksum: Introduce AVX2 implementation, unify helpers	Stefano Brivio	2021-07-26	1	-2/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Provide an AVX2-based function using compiler intrinsics for TCP/IP-style checksums. The load/unpack/add idea and implementation is largely based on code from BESS (the Berkeley Extensible Software Switch) licensed as 3-Clause BSD, with a number of modifications to further decrease pipeline stalls and to minimise cache pollution. This speeds up considerably data paths from sockets to tap interfaces, decreasing overhead for checksum computation, with 16-64KiB packet buffers, from approximately 11% to 7%. The rest is just syscalls at this point. While at it, provide convenience targets in the Makefile for avx2, avx2_debug, and debug targets -- these simply add target-specific CFLAGS to the build. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Split IPv4 and IPv6 bound port sets	Stefano Brivio	2021-07-21	1	-29/+39
\| \| \| \| \| \| \| \| \| \| \|	Allow to bind IPv4 and IPv6 ports to tap, namespace or init separately. Port numbers of TCP ports that are bound in a namespace are also bound for UDP for convenience (e.g. iperf3), and IPv4 ports are always bound if the corresponding IPv6 port is bound (socket might not have the IPV6_V6ONLY option set). This will also be configurable later. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Introduce recvmmsg()/sendmmsg(), zero-copy path from socket	Stefano Brivio	2021-07-21	1	-40/+333
\| \| \| \| \| \| \| \| \| \|	Packets are received directly onto pre-cooked, static buffers for IPv4 (with partial checksum pre-calculation) and IPv6 frames, with pre-filled Ethernet addresses and, partially, IP headers, and sent out from the same buffers with sendmmsg(), for both passt and pasta (non-local traffic only) modes. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Add PASTA mode, major rework	Stefano Brivio	2021-07-17	1	-257/+557
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host connectivity to an otherwise disconnected, unprivileged network and user namespace, similarly to slirp4netns. Given that the implementation is largely overlapping with PASST, no separate binary is built: 'pasta' (and 'passt4netns' for clarity) both link to 'passt', and the mode of operation is selected depending on how the binary is invoked. Usage example: $ unshare -rUn # echo $$ 1871759 $ ./pasta 1871759 # From another terminal # udhcpc -i pasta0 2>/dev/null # ping -c1 pasta.pizza PING pasta.pizza (64.190.62.111) 56(84) bytes of data. 64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms --- pasta.pizza ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms # ping -c1 spaghetti.pizza PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes 64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms --- spaghetti.pizza ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms This entails a major rework, especially with regard to the storage of tracked connections and to the semantics of epoll(7) references. Indexing TCP and UDP bindings merely by socket proved to be inflexible and unsuitable to handle different connection flows: pasta also provides Layer-2 to Layer-2 socket mapping between init and a separate namespace for local connections, using a pair of splice() system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local bindings. For instance, building on the previous example: # ip link set dev lo up # iperf3 -s $ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 \| tail -n4 [SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender [SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver iperf Done. epoll(7) references now include a generic part in order to demultiplex data to the relevant protocol handler, using 24 bits for the socket number, and an opaque portion reserved for usage by the single protocol handlers, in order to track sockets back to corresponding connections and bindings. A number of fixes pertaining to TCP state machine and congestion window handling are also included here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp, passt: Introduce socket packet buffer, avoid getsockname() for UDP	Stefano Brivio	2021-04-30	1	-13/+63
\| \| \| \| \| \| \| \| \| \|	This is in preparation for scatter-gather IO on the UDP receive path: save a getsockname() syscall by setting a flag if we get the numbering of all bound sockets in a strict sequence (expected, in practice) and repurpose the tap buffer to be also a socket receive buffer, passing it down to protocol handlers. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Connection tracking for ephemeral, local ports, and related fixes	Stefano Brivio	2021-04-29	1	-19/+259
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As we support UDP forwarding for packets that are sent to local ports, we actually need some kind of connection tracking for UDP. While at it, this commit introduces a number of vaguely related fixes for issues observed while trying this out. In detail: - implement an explicit, albeit minimalistic, connection tracking for UDP, to allow usage of ephemeral ports by the guest and by the host at the same time, by binding them dynamically as needed, and to allow mapping address changes for packets with a loopback address as destination - set the guest MAC address whenever we receive a packet from tap instead of waiting for an ARP request, and set it to broadcast on start, otherwise DHCPv6 might not work if all DHCPv6 requests time out before the guest starts talking IPv4 - split context IPv6 address into address we assign, global or site address seen on tap, and link-local address seen on tap, and make sure we use the addresses we've seen as destination (link-local choice depends on source address). Similarly, for IPv4, split into address we assign and address we observe, and use the address we observe as destination - introduce a clock_gettime() syscall right after epoll_wait() wakes up, so that we can remove all the other ones and pass the current timestamp to tap and socket handlers -- this is additionally needed by UDP to time out bindings to ephemeral ports and mappings between loopback address and a local address - rename sock_l4_add() to sock_l4(), no semantic changes intended - include <arpa/inet.h> in passt.c before kernel headers so that we can use <netinet/in.h> macros to check IPv6 address types, and remove a duplicate <linux/ip.h> inclusion Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Disable SO_ZEROCOPY again	Stefano Brivio	2021-04-25	1	-8/+2
\| \| \| \| \| \| \| \|	...on a second thought, this won't really help with veth, and actually causes a significant overhead as we get EPOLLERR whenever another process is tapping on the traffic. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Spare some syscalls, add some optimisations from profiling	Stefano Brivio	2021-04-23	1	-3/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Avoid a bunch of syscalls on forwarding paths by: - storing minimum and maximum file descriptor numbers for each protocol, fall back to SO_PROTOCOL query only on overlaps - allocating a larger receive buffer -- this can result in more coalesced packets than sendmmsg() can take (UIO_MAXIOV, i.e. 1024), so make sure we don't exceed that within a single call to protocol tap handlers - nesting the handling loop in tap_handler() in the receive loop, so that we have better chances of filling our receive buffer in fewer calls - skipping the recvfrom() in the UDP handler on EPOLLERR -- there's nothing to be done in that case and while at it: - restore the 20ms timer interval for periodic (TCP) events, I accidentally changed that to 100ms in an earlier commit - attempt using SO_ZEROCOPY for UDP -- if it's not available, sendmmsg() will succeed anyway - fix the handling of the status code from sendmmsg(), if it fails, we'll try to discard the first message, hence return 1 from the UDP handler Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	tcp, udp: Replace loopback source address by gateway address	Stefano Brivio	2021-04-22	1	-0/+7
\| \| \| \| \| \| \| \|	This is symmetric with tap operation and addressing model, and allows again to reach the guest behind the tap interface by contacting the local address. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Introduce packet batching mechanism	Stefano Brivio	2021-04-22	1	-19/+50
\| \| \| \| \| \| \| \| \| \|	Receive packets in batches from AF_UNIX, check if they can be sent with a single syscall, and batch them up with sendmmsg() in case. A bit rudimentary, currently only implemented for UDP, but it seems to work. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Fix typo in tcp_tap_handler() documentation	Stefano Brivio	2021-03-17	1	-1/+1
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	udp: Use size_t for return value of recvfrom()	Stefano Brivio	2021-03-17	1	-1/+1
\| \| \| \|	Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: Assorted fixes from "fresh eyes" review	Stefano Brivio	2021-02-21	1	-15/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A bunch of fixes not worth single commits at this stage, notably: - make buffer, length parameter ordering consistent in ARP, DHCP, NDP handlers - strict checking of buffer, message and option length in DHCP handler (a malicious client could have easily crashed it) - set up forwarding for IPv4 and IPv6, and masquerading with nft for IPv4, from demo script - get rid of separate slow and fast timers, we don't save any overhead that way - stricter checking of buffer lengths as passed to tap handlers - proper dequeuing from qemu socket back-end: I accidentally trashed messages that were bundled up together in a single tap read operation -- the length header tells us what's the size of the next frame, but there's no apparent limit to the number of messages we get with one single receive - rework some bits of the TCP state machine, now passive and active connection closes appear to be robust -- introduce a new FIN_WAIT_1_SOCK_FIN state indicating a FIN_WAIT_1 with a FIN flag from socket - streamline TCP option parsing routine - track TCP state changes to stderr (this is temporary, proper debugging and syslogging support pending) - observe that multiplying a number by four might very well change its value, and this happens to be the case for the data offset from the TCP header as we check if it's the same as the total length to find out if it's a duplicated ACK segment - recent estimates suggest that the duration of a millisecond is closer to a million nanoseconds than a thousand of them, this trend is now reflected into the timespec_diff_ms() convenience routine Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
*	passt: New design and implementation with native Layer 4 sockets	Stefano Brivio	2021-02-16	1	-0/+174
	This is a reimplementation, partially building on the earlier draft, that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW, providing L4-L2 translation functionality without requiring any security capability. Conceptually, this follows the design presented at: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md The most significant novelty here comes from TCP and UDP translation layers. In particular, the TCP state and translation logic follows the intent of being minimalistic, without reimplementing a full TCP stack in either direction, and synchronising as much as possible the TCP dynamic and flows between guest and host kernel. Another important introduction concerns addressing, port translation and forwarding. The Layer 4 implementations now attempt to bind on all unbound ports, in order to forward connections in a transparent way. While at it: - the qemu 'tap' back-end can't be used as-is by qrap anymore, because of explicit checks now introduced in qemu to ensure that the corresponding file descriptor is actually a tap device. For this reason, qrap now operates on a 'socket' back-end type, accounting for and building the additional header reporting frame length - provide a demo script that sets up namespaces, addresses and routes, and starts the daemon. A virtual machine started in the network namespace, wrapped by qrap, will now directly interface with passt and communicate using Layer 4 sockets provided by the host kernel. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>