aboutgitcodebugslistschat
path: root/util.c
Commit message (Collapse)AuthorAgeFilesLines
* Handle userns isolation and dropping root at the same timeDavid Gibson2022-09-131-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | passt/pasta can interact with user namespaces in a number of ways: 1) With --netns-only we'll remain in our original user namespace 2) With --userns or a PID option to pasta we'll join either the given user namespace or that of the PID 3) When pasta spawns a shell or command we'll start a new user namespace for the command and then join it 4) With passt we'll create a new user namespace when we sandbox() ourself However (3) and (4) turn out to have essentially the same effect. In both cases we create one new user namespace. The spawned command starts there, and passt/pasta itself will live there from sandbox() onwards. Because of this, we can simplify user namespace handling by moving the userns handling earlier, to the same point we drop root in the original namespace. Extend the drop_user() function to isolate_user() which does both. After switching UID and GID in the original userns, isolate_user() will either join or create the userns we require. When we spawn a command with pasta_start_ns()/pasta_setup_ns() we no longer need to create a userns, because we're already made one. sandbox() likewise no longer needs to create (or join) an userns because we're already in the one we need. We no longer need c->pasta_userns_fd, since the fd is only used locally in isolate_user(). Likewise we can replace c->netns_only with a local in conf(), since it's not used outside there. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* Move self-isolation code into a separate fileDavid Gibson2022-09-131-51/+0
| | | | | | | | passt/pasta contains a number of routines designed to isolate passt from the rest of the system for security. These are spread through util.c and passt.c. Move them together into a new isolation.c file. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* Consolidate determination of UID/GID to run asDavid Gibson2022-09-131-50/+0
| | | | | | | | | | | | Currently the logic to work out what UID and GID we will run as is spread across conf(). If --runas is specified it's handled in conf_runas(), otherwise it's handled by check_root(), which depends on initialization of the uid and gid variables by either conf() itself or conf_runas(). Make this clearer by putting all the UID and GID logic into a single conf_ugid() function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* Split checking for root from dropping root privilegeDavid Gibson2022-09-131-3/+26
| | | | | | | | | | | | | | | | | | check_root() both checks to see if we are root (in the init namespace), and if we are drops to an unprivileged user. To make future cleanups simpler, split the checking for root (now in check_root()) from the actual dropping of privilege (now in drop_root()). Note that this does slightly alter semantics. Previously we would only setuid() if we were originally root (in the init namespace). Now we will always setuid() and setgid(), though it won't actually change anything if we weren't privileged to begin with. This also means that we will now always attempt to switch to the user specified with --runas, even if we aren't (init namespace) root to begin with. Obviously this will fail with an error if we weren't privileged to start with. --help and the man page are updated accordingly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* Don't store UID & GID persistently in the context structureDavid Gibson2022-09-131-6/+6
| | | | | | | | c->uid and c->gid are first set in conf(), and last used in check_root() itself called from conf(). Therefore these don't need to be fields in the long lived context structure and can instead be locals in conf(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* util: Drop any supplementary group before dropping privilegesStefano Brivio2022-08-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | Commit a951e0b9efcb ("conf: Add --runas option, changing to given UID and GID if started as root") dropped the call to initgroups() that used to add supplementary groups corresponding to the user we'll eventually run as -- we don't need those. However, if the original user belongs to supplementary groups (usually not the case, if started as root), we don't drop those, now, and rpmlint says: passt.x86_64: E: missing-call-to-setgroups-before-setuid /usr/bin/passt passt.x86_64: E: missing-call-to-setgroups-before-setuid /usr/bin/passt.avx2 Add a call to setgroups() with an empty set, to drop any supplementary group we might currently have, before changing GID and UID. Reported-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
* Make substructures for IPv4 and IPv6 specific context informationDavid Gibson2022-07-301-2/+2
| | | | | | | | | | | | The context structure contains a batch of fields specific to IPv4 and to IPv6 connectivity. Split those out into a sub-structure. This allows the conf_ip4() and conf_ip6() functions, which take the entire context but touch very little of it, to be given more specific parameters, making it clearer what it affects without stepping through the code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* Allow different external interfaces for IPv4 and IPv6 connectivityDavid Gibson2022-07-301-1/+1
| | | | | | | | | | | | | | | | It's quite plausible for a host to have both IPv4 and IPv6 connectivity, but only via different interfaces. For example, this will happen in the case that IPv6 connectivity is via a tunnel (e.g. 6in4 or 6rd). It would also happen in the case that IPv4 access is via a tunnel on an otherwise IPv6 only local network, which is a setup that might become more common in the post IPv4 address exhaustion world. In turns out there's no real need for passt/pasta to get its IPv4 and IPv6 connectivity via the same interface, so we can handle this situation fairly easily. Change the core to allow eparate external interfaces for IPv4 and IPv6. We don't actually set these separately for now. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* util: Fix debug print on failed SO_REUSEADDR setting in sock_l4()Stefano Brivio2022-07-141-1/+1
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* Remove unused line_read()David Gibson2022-07-061-54/+0
| | | | | | The old, ugly implementation of line_read() is no longer used. Remove it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* Use new lineread implementation for procfs_scan_listen()David Gibson2022-07-061-4/+6
| | | | | | | Use the new more solid implementation of line by line reading for procfs_scan_listen(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* conf: Add --runas option, changing to given UID and GID if started as rootStefano Brivio2022-05-191-0/+52
| | | | | | | | | | | | | | On some systems, user and group "nobody" might not be available. The new --runas option allows to override the default "nobody" choice if started as root. Now that we allow this, drop the initgroups() call that was used to add any additional groups for the given user, as that might now grant unnecessarily broad permissions. For instance, several distributions have a "kvm" group to allow regular user access to /dev/kvm, and we don't need that in passt or pasta. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* conf, tcp, udp: Allow address specification for forwarded portsStefano Brivio2022-05-011-15/+12
| | | | | | | | | | | | | This feature is available in slirp4netns but was missing in passt and pasta. Given that we don't do dynamic memory allocation, we need to bind sockets while parsing port configuration. This means we need to process all other options first, as they might affect addressing and IP version support. It also implies a minor rework of how TCP and UDP implementations bind sockets. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* treewide: Unchecked return value from library, CWE-252Stefano Brivio2022-04-071-4/+7
| | | | | | | All instances were harmless, but it might be useful to have some debug messages here and there. Reported by Coverity. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* treewide: Fix android-cloexec-* clang-tidy warnings, re-enable checksStefano Brivio2022-03-291-1/+1
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* treewide: Mark constant references as constStefano Brivio2022-03-291-4/+4
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* treewide: Packet abstraction with mandatory boundary checksStefano Brivio2022-03-291-23/+37
| | | | | | | | | | | | | | | | | | | | Implement a packet abstraction providing boundary and size checks based on packet descriptors: packets stored in a buffer can be queued into a pool (without storage of its own), and data can be retrieved referring to an index in the pool, specifying offset and length. Checks ensure data is not read outside the boundaries of buffer and descriptors, and that packets added to a pool are within the buffer range with valid offset and indices. This implies a wider rework: usage of the "queueing" part of the abstraction mostly affects tap_handler_{passt,pasta}() functions and their callees, while the "fetching" part affects all the guest or tap facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6 handlers. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* util: Fix function declaration style of write_pidfile()Stefano Brivio2022-03-291-1/+2
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* tcp, udp, util: Enforce 24-bit limit on socket numbersStefano Brivio2022-03-291-0/+7
| | | | | | | This should never happen, but there are no formal guarantees: ensure socket numbers are below SOCKET_MAX. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* tcp: Refactor to use events instead of states, split out spliced implementationStefano Brivio2022-03-281-0/+19
| | | | | | | | | | | | | | | | | | | | | Using events and flags instead of states makes the implementation much more straightforward: actions are mostly centered on events that occurred on the connection rather than states. An example is given by the ESTABLISHED_SOCK_FIN_SENT and FIN_WAIT_1_SOCK_FIN abominations: we don't actually care about which side started closing the connection to handle closing of connection halves. Split out the spliced implementation, as it has very little in common with the "regular" TCP path. Refactor things here and there to improve clarity. Add helpers to trace where resets and flag settings come from. No functional changes intended. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* conf, util, tap: Implement --trace option for extra verbose loggingStefano Brivio2022-03-251-0/+6
| | | | | | | | --debug can be a bit too noisy, especially as single packets or socket messages are logged: implement a new option, --trace, implying --debug, that enables all debug messages. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* seccomp: Adjust list of allowed syscalls for armv6l, armv7lStefano Brivio2022-02-261-1/+2
| | | | | | | | | | | It looks like glibc commonly implements clock_gettime(2) with clock_gettime64(), and uses recv() instead of recvfrom(), send() instead of sendto(), and sigreturn() instead of rt_sigreturn() on armv6l and armv7l. Adjust the list of system calls for armv6l and armv7l accordingly. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt, pasta: Namespace-based sandboxing, defer seccomp policy applicationStefano Brivio2022-02-211-16/+113
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To reach (at least) a conceptually equivalent security level as implemented by --enable-sandbox in slirp4netns, we need to create a new mount namespace and pivot_root() into a new (empty) mountpoint, so that passt and pasta can't access any filesystem resource after initialisation. While at it, also detach IPC, PID (only for passt, to prevent vulnerabilities based on the knowledge of a target PID), and UTS namespaces. With this approach, if we apply the seccomp filters right after the configuration step, the number of allowed syscalls grows further. To prevent this, defer the application of seccomp policies after the initialisation phase, before the main loop, that's where we expect bad things to happen, potentially. This way, we get back to 22 allowed syscalls for passt and 34 for pasta, on x86_64. While at it, move #syscalls notes to specific code paths wherever it conceptually makes sense. We have to open all the file handles we'll ever need before sandboxing: - the packet capture file can only be opened once, drop instance numbers from the default path and use the (pre-sandbox) PID instead - /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection of bound ports in pasta mode, are now opened only once, before sandboxing, and their handles are stored in the execution context - the UNIX domain socket for passt is also bound only once, before sandboxing: to reject clients after the first one, instead of closing the listening socket, keep it open, accept and immediately discard new connection if we already have a valid one Clarify the (unchanged) behaviour for --netns-only in the man page. To actually make passt and pasta processes run in a separate PID namespace, we need to unshare(CLONE_NEWPID) before forking to background (if configured to do so). Introduce a small daemon() implementation, __daemon(), that additionally saves the PID file before forking. While running in foreground, the process itself can't move to a new PID namespace (a process can't change the notion of its own PID): mention that in the man page. For some reason, fork() in a detached PID namespace causes SIGTERM and SIGQUIT to be ignored, even if the handler is still reported as SIG_DFL: add a signal handler that just exits. We can now drop most of the pasta_child_handler() implementation, that took care of terminating all processes running in the same namespace, if pasta started a shell: the shell itself is now the init process in that namespace, and all children will terminate once the init process exits. Issuing 'echo $$' in a detached PID namespace won't return the actual namespace PID as seen from the init namespace: adapt demo and test setup scripts to reflect that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* util: Avoid return of possibly truncated unsigned long in bitmap_isset()Stefano Brivio2022-02-011-2/+2
| | | | | | | | | | | | | Oops. If *word & BITMAP_BIT(bit) is bigger than an int (which is the case for half of the possible bits of a bitmap on 64-bit archs), we'll return that as an int, that is, zero, even if the bit at hand is set. Just return zero or one there, no callers are interested in the actual bitmap as return value. Issue found as pasta wouldn't automatically detect some bound ports. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Address new clang-tidy warnings from LLVM 13.0.1Stefano Brivio2022-01-301-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | clang-tidy from LLVM 13.0.1 reports some new warnings from these checkers: - altera-unroll-loops, altera-id-dependent-backward-branch: ignore for the moment being, add a TODO item - bugprone-easily-swappable-parameters: ignore, nothing to do about those - readability-function-cognitive-complexity: ignore for the moment being, add a TODO item - altera-struct-pack-align: ignore, alignment is forced in protocol headers - concurrency-mt-unsafe: ignore for the moment being, add a TODO item Fix bugprone-implicit-widening-of-multiplication-result warnings, though, that's doable and they seem to make sense. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* tcp, udp, util: Fixes for bitmap handling on big-endian, castsStefano Brivio2022-01-261-3/+9
| | | | | | | | Bitmap manipulating functions would otherwise refer to inconsistent sets of bits on big-endian architectures. While at it, fix up a couple of casts. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* conf, pasta: Explicitly pass CLONE_{NEWUSER,NEWNET} to setns()Stefano Brivio2022-01-261-2/+2
| | | | | | | Only allow the intended types of namespaces to be joined via setns() as a defensive measure. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Drop <linux/ipv6.h> include, carry own ipv6hdr and opt_hdr definitionsStefano Brivio2022-01-261-2/+0
| | | | | | | This is the only remaining Linux-specific include -- drop it to avoid clang-tidy warnings and to make code more portable. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Add cppcheck target, test, and address resulting warningsStefano Brivio2021-10-211-2/+2
| | | | | | | ...mostly false positives, but a number of very relevant ones too, in tcp_get_sndbuf(), tcp_conn_from_tap(), and siphash PREAMBLE(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Fix build with gcc 7, use std=c99, enable some more Clang checkersStefano Brivio2021-10-211-4/+6
| | | | | | | | | | | | | | Unions and structs, you all have names now. Take the chance to enable bugprone-reserved-identifier, cert-dcl37-c, and cert-dcl51-cpp checkers in clang-tidy. Provide a ffsl() weak declaration using gcc built-in. Start reordering includes, but that's not enough for the llvm-include-order checker yet. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* util: Go to next non-empty line, skip newlines in line_read()Stefano Brivio2021-10-201-1/+5
| | | | | | | Otherwise, we'll stop returning lines at the first empty line in a file -- this is not expected in case of e.g. /etc/resolv.conf. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Add clang-tidy Makefile target and test, take care of warningsStefano Brivio2021-10-201-2/+3
| | | | | | | Most are just about style and form, but a few were actually serious mistakes (NDP-related). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Static builds: don't redefine __vsyslog(), skip getpwnam() and ↵Stefano Brivio2021-10-161-4/+7
| | | | | | initgroups() Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* util, pasta: Don't read() and lseek() every single line in read_line()Stefano Brivio2021-10-161-4/+23
| | | | | | | ...periodically checking bound ports becomes quite expensive otherwise. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* util: Don't duplicate debug messages, they're already on stderrStefano Brivio2021-10-151-4/+4
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt, pasta: Completely avoid dynamic memory allocationStefano Brivio2021-10-141-12/+117
| | | | | | | | | Replace libc functions that might dynamically allocate memory with own implementations or wrappers. Drop brk(2) from list of allowed syscalls in seccomp profile. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt, pasta: Add seccomp supportStefano Brivio2021-10-141-0/+2
| | | | | | | | | | | | | | | | | | List of allowed syscalls comes from comments in the form: #syscalls <list> for syscalls needed both in passt and pasta mode, and: #syscalls:pasta <list> #syscalls:passt <list> for syscalls specifically needed in pasta or passt mode only. seccomp.sh builds a list of BPF statements from those comments, prefixed by a binary search tree to keep lookup fast. While at it, clean up a bit the Makefile using wildcards. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* util: Fix comment to bitmap_clear()Stefano Brivio2021-10-141-1/+1
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* conf, tap: Split netlink and pasta functions, allow interface configurationStefano Brivio2021-10-141-1/+1
| | | | | | | | | | Move netlink routines to their own file, and use netlink to configure or fetch all the information we need, except for the TUNSETIFF ioctl. Move pasta-specific functions to their own file as well, add parameters and calls to configure the tap interface in the namespace. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* pasta: Allow specifying paths and names of namespacesGiuseppe Scrivano2021-10-071-20/+8
| | | | | | | | | | | | | | | | | | Based on a patch from Giuseppe Scrivano, this adds the ability to: - specify paths and names of target namespaces to join, instead of a PID, also for user namespaces, with --userns - request to join or create a network namespace only, without entering or creating a user namespace, with --netns-only - specify the base directory for netns mountpoints, with --nsrun-dir Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> [sbrivio: reworked logic to actually join the given namespaces when they're not created, implemented --netns-only and --nsrun-dir, updated pasta demo script and man page] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* tcp, tap: Turn tcp_probe_mem() into sock_probe_mem(), use for AF_UNIX socket tooStefano Brivio2021-10-051-0/+28
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* conf, tcp: Periodic detection of bound ports for pasta port forwardingStefano Brivio2021-09-271-2/+6
| | | | | | | | | | Detecting bound ports at start-up time isn't terribly useful: do this periodically instead, if configured. This is only implemented for TCP at the moment, UDP is somewhat more complicated: leave a TODO there. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* util: Fix parsing of next option in ipv6_l4hdr()Stefano Brivio2021-09-271-2/+1
| | | | | | | We need to update next header and header length as soon as we meet a new option header. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt, pasta: Introduce command-line options and port re-mappingStefano Brivio2021-09-011-24/+23
| | | | Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* util: Don't close ping sockets if bind() failsStefano Brivio2021-08-041-3/+6
| | | | | | | ...they're still usable, thanks to the workaround implemented in icmp_tap_handler(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* util: Fix millisecond logging timestamp calculationStefano Brivio2021-08-041-1/+1
| | | | | | | Four sub-second digits means 0.1ms units: divide nanoseconds by 10^5, not 10^6. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* tcp, udp: Allow binding ports in init namespace to both tap and loopbackStefano Brivio2021-07-261-5/+9
| | | | | | | | Traffic with loopback source address will be forwarded to the direct loopback connection in the namespace, and the tap interface is used for the rest. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* checksum: Introduce AVX2 implementation, unify helpersStefano Brivio2021-07-261-80/+0
| | | | | | | | | | | | | | | | | | | Provide an AVX2-based function using compiler intrinsics for TCP/IP-style checksums. The load/unpack/add idea and implementation is largely based on code from BESS (the Berkeley Extensible Software Switch) licensed as 3-Clause BSD, with a number of modifications to further decrease pipeline stalls and to minimise cache pollution. This speeds up considerably data paths from sockets to tap interfaces, decreasing overhead for checksum computation, with 16-64KiB packet buffers, from approximately 11% to 7%. The rest is just syscalls at this point. While at it, provide convenience targets in the Makefile for avx2, avx2_debug, and debug targets -- these simply add target-specific CFLAGS to the build. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* udp: Introduce recvmmsg()/sendmmsg(), zero-copy path from socketStefano Brivio2021-07-211-14/+26
| | | | | | | | | | Packets are received directly onto pre-cooked, static buffers for IPv4 (with partial checksum pre-calculation) and IPv6 frames, with pre-filled Ethernet addresses and, partially, IP headers, and sent out from the same buffers with sendmmsg(), for both passt and pasta (non-local traffic only) modes. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
* passt: Add PASTA mode, major reworkStefano Brivio2021-07-171-31/+131
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host connectivity to an otherwise disconnected, unprivileged network and user namespace, similarly to slirp4netns. Given that the implementation is largely overlapping with PASST, no separate binary is built: 'pasta' (and 'passt4netns' for clarity) both link to 'passt', and the mode of operation is selected depending on how the binary is invoked. Usage example: $ unshare -rUn # echo $$ 1871759 $ ./pasta 1871759 # From another terminal # udhcpc -i pasta0 2>/dev/null # ping -c1 pasta.pizza PING pasta.pizza (64.190.62.111) 56(84) bytes of data. 64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms --- pasta.pizza ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms # ping -c1 spaghetti.pizza PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes 64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms --- spaghetti.pizza ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms This entails a major rework, especially with regard to the storage of tracked connections and to the semantics of epoll(7) references. Indexing TCP and UDP bindings merely by socket proved to be inflexible and unsuitable to handle different connection flows: pasta also provides Layer-2 to Layer-2 socket mapping between init and a separate namespace for local connections, using a pair of splice() system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local bindings. For instance, building on the previous example: # ip link set dev lo up # iperf3 -s $ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4 [SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender [SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver iperf Done. epoll(7) references now include a generic part in order to demultiplex data to the relevant protocol handler, using 24 bits for the socket number, and an opaque portion reserved for usage by the single protocol handlers, in order to track sockets back to corresponding connections and bindings. A number of fixes pertaining to TCP state machine and congestion window handling are also included here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>