| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Depending on the C library, but not necessarily in all the
functions we use, statx() might be used instead of stat(),
getdents() instead of getdents64(), readlinkat() instead of
readlink(), openat() instead of open().
On aarch64, it's clone() and not fork(), and dup3() instead of
dup2() -- just allow the existing alternative instead of dealing
with per-arch selections.
Since glibc commit 9a7565403758 ("posix: Consolidate fork
implementation"), we need to allow set_robust_list() for
fork()/clone(), even in a single-threaded context.
On some architectures, epoll_pwait() is provided instead of
epoll_wait(), but never both. Same with newfstat() and
fstat(), sigreturn() and rt_sigreturn(), getdents64() and
getdents(), readlink() and readlinkat(), unlink() and
unlinkat(), whereas pipe() might not be available, but
pipe2() always is, exclusively or not.
Seen on Fedora 34: newfstatat() is used on top of fstat().
syslog() is an actual system call on some glibc/arch combinations,
instead of a connect()/send() implementation.
On ppc64 and ppc64le, _llseek(), recv(), send() and getuid()
are used. For ppc64 only: ugetrlimit() for the getrlimit()
implementation, plus sigreturn() and fcntl64().
On s390x, additionally, we need to allow socketcall() (on top
of socket()), and sigreturn() also for passt (not just for
pasta).
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On some distributions, on ppc64, ulimit -s returns 'unlimited': add a
reasonable default, and also make sure ulimit is invoked using the
default shell, which should ensure ulimit is actually implemented.
Also note that AUDIT_ARCH doesn't follow closely the naming reported
by 'uname -m': convert for i386 and ppc as needed.
While at it, move inclusion of seccomp.h after util.h, the former is
less generic (cosmetic/clang-tidy only).
Older kernel headers might lack a definition for AUDIT_ARCH_PPC64LE:
define that explicitly if it's not available.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
| |
This is the only remaining Linux-specific include -- drop it to avoid
clang-tidy warnings and to make code more portable.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
| |
...it looks like, on a recent Fedora installation, daemon() uses it.
Reported-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
This is actually annoying: there's no way to make it fork into
background when running from a script. However, it's always
possible to keep it in foreground with -f. Make it simpler, and
always fork into background if -f is not given.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
| |
...mostly false positives, but a number of very relevant ones too,
in tcp_get_sndbuf(), tcp_conn_from_tap(), and siphash PREAMBLE().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unions and structs, you all have names now.
Take the chance to enable bugprone-reserved-identifier,
cert-dcl37-c, and cert-dcl51-cpp checkers in clang-tidy.
Provide a ffsl() weak declaration using gcc built-in.
Start reordering includes, but that's not enough for the
llvm-include-order checker yet.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
A mix of unchecked return values, a missing permission mask for
open(2) with O_CREAT, and some false positives from
-Wstringop-overflow and -Wmaybe-uninitialized.
Reported-by: Martin Hauke <mardnh@gmx.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
| |
We don't use libseccomp.
Reported-by: Martin Hauke <mardnh@gmx.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
| |
...depending on the system clock source, glibc might use it to
fetch the wall time.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
| |
initgroups()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
| |
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
| |
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
| |
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
| |
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
While it's not recommended to give passt any capability, drop all the
ones we might have got by mistake, except for the only sensible one,
CAP_NET_BIND_SERVICE.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Replace libc functions that might dynamically allocate memory with own
implementations or wrappers.
Drop brk(2) from list of allowed syscalls in seccomp profile.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
List of allowed syscalls comes from comments in the form:
#syscalls <list>
for syscalls needed both in passt and pasta mode, and:
#syscalls:pasta <list>
#syscalls:passt <list>
for syscalls specifically needed in pasta or passt mode only.
seccomp.sh builds a list of BPF statements from those comments,
prefixed by a binary search tree to keep lookup fast.
While at it, clean up a bit the Makefile using wildcards.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Move netlink routines to their own file, and use netlink to configure
or fetch all the information we need, except for the TUNSETIFF ioctl.
Move pasta-specific functions to their own file as well, add
parameters and calls to configure the tap interface in the namespace.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
We usually have up to one additional child exiting while we receive
a SIGCHLD, instead of complicating this with tracking PIDs, just
add a second waitid() call.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Based on a patch from Giuseppe Scrivano, this adds the ability to:
- specify paths and names of target namespaces to join, instead of
a PID, also for user namespaces, with --userns
- request to join or create a network namespace only, without
entering or creating a user namespace, with --netns-only
- specify the base directory for netns mountpoints, with --nsrun-dir
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[sbrivio: reworked logic to actually join the given namespaces when
they're not created, implemented --netns-only and --nsrun-dir,
updated pasta demo script and man page]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
| |
...from 11MiB to 155KiB for 'make avx2', 95KiB with -Os and stripped.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
| |
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
We'll need this for TCP ACK coalescing on tap/guest-side. For
convenience, allow _handler() functions to be undefined, courtesy
of __attribute__((weak)).
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
| |
The initial timestamp was not initialised, so timers for protocol
handlers wouldn't run at all sometimes.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
If transparent huge pages are available, madvise() will do the trick.
While at it, decrease EPOLL_EVENTS for the main loop from 10 to 8,
for slightly better socket fairness.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
If pasta created the namespace, it's probably expected that processes
started in the same namespace are terminated once pasta exits. Scan
procfs namespace links for corresponding processes, send SIGQUIT and
SIGKILL (after one second) if found.
While at it, make the signal handler reap otherwise-zombies resulting
from clone().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
| |
...this allows processes running as the only group available in the
namespace to create ICMP Echo sockets.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
| |
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
| |
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
| |
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...similarly to what was done for UDP. Quick performance test with
32KiB buffers, host to VM:
$ iperf3 -c 192.0.2.2 -N
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 8.47 GBytes 7.27 Gbits/sec 0 sender
[ 5] 0.00-10.00 sec 8.45 GBytes 7.26 Gbits/sec receiver
$ iperf3 -c 2a01:598:88ba:a056:271f:473a:c0d9:abc1
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 8.43 GBytes 7.24 Gbits/sec 0 sender
[ 5] 0.00-10.00 sec 8.41 GBytes 7.22 Gbits/sec receiver
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
Traffic with loopback source address will be forwarded to the direct
loopback connection in the namespace, and the tap interface is used
for the rest.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Allow to bind IPv4 and IPv6 ports to tap, namespace or init separately.
Port numbers of TCP ports that are bound in a namespace are also bound
for UDP for convenience (e.g. iperf3), and IPv4 ports are always bound
if the corresponding IPv6 port is bound (socket might not have the
IPV6_V6ONLY option set). This will also be configurable later.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Packets are received directly onto pre-cooked, static buffers
for IPv4 (with partial checksum pre-calculation) and IPv6 frames,
with pre-filled Ethernet addresses and, partially, IP headers,
and sent out from the same buffers with sendmmsg(), for both
passt and pasta (non-local traffic only) modes.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
The most common case is actually that no other instance created a
socket with that name -- and that also means there is no other
instance.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
| |
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
| |
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...sharing the same filesystem. Instead of a fixed path for the UNIX
domain socket, passt now uses a path with a counter, probing for
existing instances, and picking the first free one.
The demo script is updated accordingly -- it can now be started several
times to create multiple namespaces with an instance of passt each,
with addressing reflecting separate subnets, and NDP proxying between
them.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
and, given that the connection table is indexed by socket number,
we also need to increase MAX_CONNS now as the ICMP implementation
needs 2^17 sockets, that will be opened before TCP connections are
accepted.
This needs to be changed later: the connection table should be
indexed by a translated number -- we're wasting 2^17 table entries
otherwise. Move initialisation of TCP listening sockets as last
per-protocol initialisation, this will make it easier.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
The socket isn't necessarily closed, make sure we close it before
getting a new one from accept(), so that we don't mix it up with
protocol sockets numbering.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
With -DDEBUG, passt now saves guest-side traffic captures in
pcap format at /tmp/passt_<ISO8601 timestamp>.pcap. The timestamp
refers to time and date of start-up.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Add support for a variable amount of DNS servers, including zero,
from /etc/resolv.conf, in DHCP, NDP and DHCPv6 implementations.
Introduce support for domain search list for DHCP (RFC 3397),
NDP (RFC 8106), and DHCPv6 (RFC 3646), also sourced from
/etc/resolv.conf.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
| |
...otherwise, we have no idea what's going on if we receive something
unexpected.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
Once passt forks to background, it should be guaranteed that the UNIX
domain socket is available, otherwise, if qemu is started right after
it, it might fail to connect.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
Multiple arrays, one for each address, were needed with a single
fprintf(). Now that it's replaced by info(), we can have just one
for each protocol version.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
With glibc, we can't reliably build a static binary with
getprotobynumber(), which is currently used with -DDEBUG.
Replace that with a small array of protocol strings.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
This is in preparation for scatter-gather IO on the UDP receive path:
save a getsockname() syscall by setting a flag if we get the numbering
of all bound sockets in a strict sequence (expected, in practice) and
repurpose the tap buffer to be also a socket receive buffer, passing
it down to protocol handlers.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|