aboutgitcodebugslistschat
path: root/passt.c
Commit message (Collapse)AuthorAgeFilesLines
* passt, pasta: Namespace-based sandboxing, defer seccomp policy applicationStefano Brivio2022-02-211-47/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To reach (at least) a conceptually equivalent security level as implemented by --enable-sandbox in slirp4netns, we need to create a new mount namespace and pivot_root() into a new (empty) mountpoint, so that passt and pasta can't access any filesystem resource after initialisation. While at it, also detach IPC, PID (only for passt, to prevent vulnerabilities based on the knowledge of a target PID), and UTS namespaces. With this approach, if we apply the seccomp filters right after the configuration step, the number of allowed syscalls grows further. To prevent this, defer the application of seccomp policies after the initialisation phase, before the main loop, that's where we expect bad things to happen, potentially. This way, we get back to 22 allowed syscalls for passt and 34 for pasta, on x86_64. While at it, move #syscalls notes to specific code paths wherever it conceptually makes sense. We have to open all the file handles we'll ever need before sandboxing: - the packet capture file can only be opened once, drop instance numbers from the default path and use the (pre-sandbox) PID instead - /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection of bound ports in pasta mode, are now opened only once, before sandboxing, and their handles are stored in the execution context - the UNIX domain socket for passt is also bound only once, before sandboxing: to reject clients after the first one, instead of closing the listening socket, keep it open, accept and immediately discard new connection if we already have a valid one Clarify the (unchanged) behaviour for --netns-only in the man page. To actually make passt and pasta processes run in a separate PID namespace, we need to unshare(CLONE_NEWPID) before forking to background (if configured to do so). Introduce a small daemon() implementation, __daemon(), that additionally saves the PID file before forking. While running in foreground, the process itself can't move to a new PID namespace (a process can't change the notion of its own PID): mention that in the man page. For some reason, fork() in a detached PID namespace causes SIGTERM and SIGQUIT to be ignored, even if the handler is still reported as SIG_DFL: add a signal handler that just exits. We can now drop most of the pasta_child_handler() implementation, that took care of terminating all processes running in the same namespace, if pasta started a shell: the shell itself is now the init process in that namespace, and all children will terminate once the init process exits. Issuing 'echo $$' in a detached PID namespace won't return the actual namespace PID as seen from the init namespace: adapt demo and test setup scripts to reflect that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt, tap: Daemonise once socket is ready without waiting for connectionStefano Brivio2022-01-281-2/+4
| | | | | | | | | | | | | The existing behaviour is not really practical: an automated agent in charge of starting both qemu and passt would need to fork itself to start passt, because passt won't fork to background until qemu connects, and the agent needs to unblock to start qemu. Instead of waiting for a connection to daemonise, do it right away as soon as a socket is available: that can be considered an initialised state already. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* seccomp: Add a number of alternate and per-arch syscallsStefano Brivio2022-01-261-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Depending on the C library, but not necessarily in all the functions we use, statx() might be used instead of stat(), getdents() instead of getdents64(), readlinkat() instead of readlink(), openat() instead of open(). On aarch64, it's clone() and not fork(), and dup3() instead of dup2() -- just allow the existing alternative instead of dealing with per-arch selections. Since glibc commit 9a7565403758 ("posix: Consolidate fork implementation"), we need to allow set_robust_list() for fork()/clone(), even in a single-threaded context. On some architectures, epoll_pwait() is provided instead of epoll_wait(), but never both. Same with newfstat() and fstat(), sigreturn() and rt_sigreturn(), getdents64() and getdents(), readlink() and readlinkat(), unlink() and unlinkat(), whereas pipe() might not be available, but pipe2() always is, exclusively or not. Seen on Fedora 34: newfstatat() is used on top of fstat(). syslog() is an actual system call on some glibc/arch combinations, instead of a connect()/send() implementation. On ppc64 and ppc64le, _llseek(), recv(), send() and getuid() are used. For ppc64 only: ugetrlimit() for the getrlimit() implementation, plus sigreturn() and fcntl64(). On s390x, additionally, we need to allow socketcall() (on top of socket()), and sigreturn() also for passt (not just for pasta). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* Makefile, seccomp: Fix build for i386, ppc64, ppc64leStefano Brivio2022-01-261-1/+1
| | | | | | | | | | | | | | | | | On some distributions, on ppc64, ulimit -s returns 'unlimited': add a reasonable default, and also make sure ulimit is invoked using the default shell, which should ensure ulimit is actually implemented. Also note that AUDIT_ARCH doesn't follow closely the naming reported by 'uname -m': convert for i386 and ppc as needed. While at it, move inclusion of seccomp.h after util.h, the former is less generic (cosmetic/clang-tidy only). Older kernel headers might lack a definition for AUDIT_ARCH_PPC64LE: define that explicitly if it's not available. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Drop <linux/ipv6.h> include, carry own ipv6hdr and opt_hdr definitionsStefano Brivio2022-01-261-1/+0
| | | | | | | This is the only remaining Linux-specific include -- drop it to avoid clang-tidy warnings and to make code more portable. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* seccomp: Add newfstatat to list of allowed syscallsStefano Brivio2021-10-211-1/+1
| | | | | | | ...it looks like, on a recent Fedora installation, daemon() uses it. Reported-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Fork into background also if not running from a terminalStefano Brivio2021-10-211-1/+1
| | | | | | | | | This is actually annoying: there's no way to make it fork into background when running from a script. However, it's always possible to keep it in foreground with -f. Make it simpler, and always fork into background if -f is not given. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Add cppcheck target, test, and address resulting warningsStefano Brivio2021-10-211-3/+1
| | | | | | | ...mostly false positives, but a number of very relevant ones too, in tcp_get_sndbuf(), tcp_conn_from_tap(), and siphash PREAMBLE(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Fix build with gcc 7, use std=c99, enable some more Clang checkersStefano Brivio2021-10-211-18/+15
| | | | | | | | | | | | | | Unions and structs, you all have names now. Take the chance to enable bugprone-reserved-identifier, cert-dcl37-c, and cert-dcl51-cpp checkers in clang-tidy. Provide a ffsl() weak declaration using gcc built-in. Start reordering includes, but that's not enough for the llvm-include-order checker yet. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Address gcc 11 warningsStefano Brivio2021-10-201-4/+9
| | | | | | | | | A mix of unchecked return values, a missing permission mask for open(2) with O_CREAT, and some false positives from -Wstringop-overflow and -Wmaybe-uninitialized. Reported-by: Martin Hauke <mardnh@gmx.de> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Include linux/seccomp.h and linux/audit.h instead of seccomp.hStefano Brivio2021-10-191-1/+2
| | | | | | | We don't use libseccomp. Reported-by: Martin Hauke <mardnh@gmx.de> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Add clock_gettime to list of allowed syscallsStefano Brivio2021-10-161-0/+1
| | | | | | | ...depending on the system clock source, glibc might use it to fetch the wall time. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Static builds: don't redefine __vsyslog(), skip getpwnam() and ↵Stefano Brivio2021-10-161-5/+10
| | | | | | initgroups() Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Check if a PID file was actually requested before creating itStefano Brivio2021-10-151-1/+1
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Don't refuse to run if UID is 0 in non-init namespaceStefano Brivio2021-10-141-1/+14
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* conf: Add -P, --pid, to specify a file where own PID is written toStefano Brivio2021-10-141-1/+24
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Warn if we're running as root, abort if we can't change to nobody:nobodyStefano Brivio2021-10-141-0/+29
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Drop all capabilities that we might have, except for CAP_NET_BIND_SERVICEStefano Brivio2021-10-141-0/+18
| | | | | | | | While it's not recommended to give passt any capability, drop all the ones we might have got by mistake, except for the only sensible one, CAP_NET_BIND_SERVICE. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt, pasta: Completely avoid dynamic memory allocationStefano Brivio2021-10-141-8/+8
| | | | | | | | | Replace libc functions that might dynamically allocate memory with own implementations or wrappers. Drop brk(2) from list of allowed syscalls in seccomp profile. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt, pasta: Add seccomp supportStefano Brivio2021-10-141-0/+36
| | | | | | | | | | | | | | | | | | List of allowed syscalls comes from comments in the form: #syscalls <list> for syscalls needed both in passt and pasta mode, and: #syscalls:pasta <list> #syscalls:passt <list> for syscalls specifically needed in pasta or passt mode only. seccomp.sh builds a list of BPF statements from those comments, prefixed by a binary search tree to keep lookup fast. While at it, clean up a bit the Makefile using wildcards. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* conf, tap: Split netlink and pasta functions, allow interface configurationStefano Brivio2021-10-141-181/+2
| | | | | | | | | | Move netlink routines to their own file, and use netlink to configure or fetch all the information we need, except for the TUNSETIFF ioctl. Move pasta-specific functions to their own file as well, add parameters and calls to configure the tap interface in the namespace. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* pasta: Add second waitid() in pasta_child_handler()Stefano Brivio2021-10-071-0/+1
| | | | | | | | We usually have up to one additional child exiting while we receive a SIGCHLD, instead of complicating this with tracking PIDs, just add a second waitid() call. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* pasta: Allow specifying paths and names of namespacesGiuseppe Scrivano2021-10-071-20/+39
| | | | | | | | | | | | | | | | | | Based on a patch from Giuseppe Scrivano, this adds the ability to: - specify paths and names of target namespaces to join, instead of a PID, also for user namespaces, with --userns - request to join or create a network namespace only, without entering or creating a user namespace, with --netns-only - specify the base directory for netns mountpoints, with --nsrun-dir Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> [sbrivio: reworked logic to actually join the given namespaces when they're not created, implemented --netns-only and --nsrun-dir, updated pasta demo script and man page] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Shrink binary size by dropping static initialisersStefano Brivio2021-10-051-2/+6
| | | | | | ...from 11MiB to 155KiB for 'make avx2', 95KiB with -Os and stripped. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* tcp, tap: Turn tcp_probe_mem() into sock_probe_mem(), use for AF_UNIX socket tooStefano Brivio2021-10-051-0/+1
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Add handler for optional deferred tasksStefano Brivio2021-10-051-20/+26
| | | | | | | | We'll need this for TCP ACK coalescing on tap/guest-side. For convenience, allow _handler() functions to be undefined, courtesy of __attribute__((weak)). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Actually initialise timers for protocol handlersStefano Brivio2021-09-271-2/+16
| | | | | | | The initial timestamp was not initialised, so timers for protocol handlers wouldn't run at all sometimes. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Align pkt_buf to PAGE_SIZE (start and size), try to fit in huge pagesStefano Brivio2021-09-271-2/+6
| | | | | | | | | If transparent huge pages are available, madvise() will do the trick. While at it, decrease EPOLL_EVENTS for the main loop from 10 to 8, for slightly better socket fairness. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* pasta: Clean up namespace processes on exit, reap zombies from clone()Stefano Brivio2021-09-151-9/+83
| | | | | | | | | | | | If pasta created the namespace, it's probably expected that processes started in the same namespace are terminated once pasta exits. Scan procfs namespace links for corresponding processes, send SIGQUIT and SIGKILL (after one second) if found. While at it, make the signal handler reap otherwise-zombies resulting from clone(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* pasta: Set ping_group_range upon namespace creationStefano Brivio2021-09-091-0/+4
| | | | | | | ...this allows processes running as the only group available in the namespace to create ICMP Echo sockets. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Add epoll event indication and passt/pasta mode in socket debug messageStefano Brivio2021-09-091-1/+3
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* pasta: If a new namespace is created, wait for it to be ready before proceedingStefano Brivio2021-09-011-1/+15
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt, pasta: Introduce command-line options and port re-mappingStefano Brivio2021-09-011-423/+101
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* tcp: Introduce scatter-gather IO path from socket to tapStefano Brivio2021-07-261-0/+1
| | | | | | | | | | | | | | | | | ...similarly to what was done for UDP. Quick performance test with 32KiB buffers, host to VM: $ iperf3 -c 192.0.2.2 -N [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 8.47 GBytes 7.27 Gbits/sec 0 sender [ 5] 0.00-10.00 sec 8.45 GBytes 7.26 Gbits/sec receiver $ iperf3 -c 2a01:598:88ba:a056:271f:473a:c0d9:abc1 [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 8.43 GBytes 7.24 Gbits/sec 0 sender [ 5] 0.00-10.00 sec 8.41 GBytes 7.22 Gbits/sec receiver Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* tcp, udp: Allow binding ports in init namespace to both tap and loopbackStefano Brivio2021-07-261-0/+15
| | | | | | | | Traffic with loopback source address will be forwarded to the direct loopback connection in the namespace, and the tap interface is used for the rest. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* tcp, udp: Split IPv4 and IPv6 bound port setsStefano Brivio2021-07-211-15/+31
| | | | | | | | | | | Allow to bind IPv4 and IPv6 ports to tap, namespace or init separately. Port numbers of TCP ports that are bound in a namespace are also bound for UDP for convenience (e.g. iperf3), and IPv4 ports are always bound if the corresponding IPv6 port is bound (socket might not have the IPV6_V6ONLY option set). This will also be configurable later. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* udp: Introduce recvmmsg()/sendmmsg(), zero-copy path from socketStefano Brivio2021-07-211-0/+14
| | | | | | | | | | Packets are received directly onto pre-cooked, static buffers for IPv4 (with partial checksum pre-calculation) and IPv6 frames, with pre-filled Ethernet addresses and, partially, IP headers, and sent out from the same buffers with sendmmsg(), for both passt and pasta (non-local traffic only) modes. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Add PASTA mode, major reworkStefano Brivio2021-07-171-439/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host connectivity to an otherwise disconnected, unprivileged network and user namespace, similarly to slirp4netns. Given that the implementation is largely overlapping with PASST, no separate binary is built: 'pasta' (and 'passt4netns' for clarity) both link to 'passt', and the mode of operation is selected depending on how the binary is invoked. Usage example: $ unshare -rUn # echo $$ 1871759 $ ./pasta 1871759 # From another terminal # udhcpc -i pasta0 2>/dev/null # ping -c1 pasta.pizza PING pasta.pizza (64.190.62.111) 56(84) bytes of data. 64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms --- pasta.pizza ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms # ping -c1 spaghetti.pizza PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes 64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms --- spaghetti.pizza ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms This entails a major rework, especially with regard to the storage of tracked connections and to the semantics of epoll(7) references. Indexing TCP and UDP bindings merely by socket proved to be inflexible and unsuitable to handle different connection flows: pasta also provides Layer-2 to Layer-2 socket mapping between init and a separate namespace for local connections, using a pair of splice() system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local bindings. For instance, building on the previous example: # ip link set dev lo up # iperf3 -s $ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4 [SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender [SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver iperf Done. epoll(7) references now include a generic part in order to demultiplex data to the relevant protocol handler, using 24 bits for the socket number, and an opaque portion reserved for usage by the single protocol handlers, in order to track sockets back to corresponding connections and bindings. A number of fixes pertaining to TCP state machine and congestion window handling are also included here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: When probing for an existing instance, also accept ENOENT on connect()Stefano Brivio2021-05-231-1/+1
| | | | | | | | The most common case is actually that no other instance created a socket with that name -- and that also means there is no other instance. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* util: On -DDEBUG, log to stderr with timestampsStefano Brivio2021-05-211-1/+1
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Also log to stderr, don't fork to background if not interactiveStefano Brivio2021-05-211-2/+2
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Add support for multiple instances in different network namespacesStefano Brivio2021-05-211-13/+33
| | | | | | | | | | | | | ...sharing the same filesystem. Instead of a fixed path for the UNIX domain socket, passt now uses a path with a counter, probing for existing instances, and picking the first free one. The demo script is updated accordingly -- it can now be started several times to create multiple namespaces with an instance of passt each, with addressing reflecting separate subnets, and NDP proxying between them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* tcp: Actually enforce MAX_CONNS limitStefano Brivio2021-05-211-1/+1
| | | | | | | | | | | | | | and, given that the connection table is indexed by socket number, we also need to increase MAX_CONNS now as the ICMP implementation needs 2^17 sockets, that will be opened before TCP connections are accepted. This needs to be changed later: the connection table should be indexed by a translated number -- we're wasting 2^17 table entries otherwise. Move initialisation of TCP listening sockets as last per-protocol initialisation, this will make it easier. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Close UNIX domain socket on failure before accepting new connectionsStefano Brivio2021-05-211-1/+3
| | | | | | | | The socket isn't necessarily closed, make sure we close it before getting a new one from accept(), so that we don't mix it up with protocol sockets numbering. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Introduce packet capture implementationStefano Brivio2021-05-211-0/+5
| | | | | | | | With -DDEBUG, passt now saves guest-side traffic captures in pcap format at /tmp/passt_<ISO8601 timestamp>.pcap. The timestamp refers to time and date of start-up. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* dhcp, ndp, dhcpv6: Support for multiple DNS servers, search listStefano Brivio2021-05-211-20/+57
| | | | | | | | | | | Add support for a variable amount of DNS servers, including zero, from /etc/resolv.conf, in DHCP, NDP and DHCPv6 implementations. Introduce support for domain search list for DHCP (RFC 3397), NDP (RFC 8106), and DHCPv6 (RFC 3646), also sourced from /etc/resolv.conf. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: With -DDEBUG, also print protocol number for unsupported protocolsStefano Brivio2021-05-211-5/+5
| | | | | | | ...otherwise, we have no idea what's going on if we receive something unexpected. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Don't fork into background until the UNIX domain socket isn't listeningStefano Brivio2021-05-101-4/+7
| | | | | | | | Once passt forks to background, it should be guaranteed that the UNIX domain socket is available, otherwise, if qemu is started right after it, it might fail to connect. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Keep just two arrays to print context IPv4 and IPv6 addressesStefano Brivio2021-05-101-9/+9
| | | | | | | | Multiple arrays, one for each address, were needed with a single fprintf(). Now that it's replaced by info(), we can have just one for each protocol version. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Don't use getprotobynumber() in debug buildStefano Brivio2021-05-101-3/+17
| | | | | | | | | With glibc, we can't reliably build a static binary with getprotobynumber(), which is currently used with -DDEBUG. Replace that with a small array of protocol strings. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>