| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Apart from which mh array they're operating on the recvmmsg() calls in
udp_sock_handler() are identical between the IPv4 and IPv6 paths, as are
some of the control structure updates.
By using some local variables to refer to the IP version specific control
arrays, make some more logic common between the IPv4 and IPv6 paths. As
well as slightly reducing the code size, this makes it less likely that
we'll accidentally use the IPv4 arrays in the IPv6 path or vice versa as we
did in a recently fixed bug.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
udp_sock_handler() incorrectly uses udp6_l2_mh_tap[] on the IPv4 path. In
fact this is harmless because this assignment is redundant (the 0th entry
msg_hdr will always point to the 0th iov entry for both IPv4 and IPv6 and
won't change).
There is also an incorrect usage of udp6_l2_mh_tap[] in
udp_sock_fill_data_v4. This one can cause real problems, because we'll
use stale iov_len values if we send multiple messages to the qemu socket.
Most of the time that will be relatively harmless - we're likely to either
drop UDP packets, or send duplicates. However, if the stale iov_len we
use ends up referencing an uninitialized buffer we could desynchronize the
qemu stream socket.
Correct both these bugs. The UDP6 path appears to be correct, but it does
have some comments that incorrectly reference the IPv4 versions, so fix
those as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
udp_sock_handler_splice() reads a whole batch of datagrams at once with
recvmmsg(). It then forwards them all via a single socket on the other
side, based on the source port.
However, it's entirely possible that the datagrams in the set have
different source ports, and thus ought to be forwarded via different
sockets on the destination side. In fact this situation arises with the
iperf -P4 throughput tests in our own test suite. AFAICT we only get away
with this because iperf3 is strictly one way and doesn't send reply packets
which would be misdirected because of the incorrect source ports.
Alter udp_sock_handler_splice() to split the packets it receives into
batches with the same source address and send each batch with a separate
sendmmsg().
For now we only look for already contiguous batches, which means that if
there are multiple active flows interleaved this is likely to degenerate
to batches of size 1. For now this is the simplest way to correct the
behaviour and we can try to optimize later.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Move the part of udp_sock_handler_splice() concerned with sending out the
datagrams into a new udp_splice_sendfrom() helper. This will make later
cleanups easier.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We maintain a set of buffers for UDP packets to be forwarded via the tap
interface in udp[46]_l2_buf. We then have a separate set of buffers for
packets to be "spliced" in udp_splice_buf[]. However, we only use one of
these at a time, so we can share the buffer space.
For the receiving splice packets we can not only re-use the data buffers
but also the udp[46]_l2_iov_sock and udp[46]_l2_mh_sock control structures.
For sending the splice packets we keep the same data buffers, but we need
specific control structures. We create udp[46]_iov_splice - we can't
reuse udp_l2_iov_sock[] because we need to write iov_len as we're writing
spliced packets, but the tap path expects iov_len to remain the same (it
only uses it for receive). Likewise we create udp[46]_mh_splice with the
mmsghdr structures for sending spliced packets. As well as needing to
reference different iovs, these need to all reference udp_splice_namebuf
instead of individual msg_name fields for each slot.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
udp_sock_handler_splice() has a somewhat clunky if to extract the port from
a socket address which could be either IPv4 or IPv6. Future changes are
going to make this even more clunky, so introduce a helper function to
do this extraction.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
These two constants have the same value, and there's not a lot of reason
they'd ever need to be different. Future changes will further integrate
the spliced and "tap" paths so that these need to be the same. So, merge
them into UDP_MAX_FRAMES.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
Previous cleanups mean that we can now rework some complex ifs in
udp_sock_handler_splice() into a simpler set.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A UDP pseudo-connection between port A in the init namespace and port B in
the pasta guest namespace involves two sockets: udp_splice_init[v6][B]
and udp_splice_ns[v6][A]. The socket which originated this "connection"
will be permanent but the other one will be closed on a timeout.
When we get a packet from the originating socket, we update the timeout on
the other socket, but we don't do the same when we get a reply packet from
the other socket. However any activity on the "connection" probably
indicates that it's still in use. Without this we could incorrectly time
out a "connection" if it's using a protocol which involves a single
initiating packet, but which then gets continuing replies from the target.
Correct this by updating the timeout on both sockets for a packet in either
direction. This also updates the timestamps for the permanent originating
sockets which is unnecessary, but harmless.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we look up udp_splice_to_ns[][].orig_sock in udp_sock_handler_splice()
we're finding the socket on which the originating packet for the
"connection" was received on. However, we don't specifically need this
socket to be the originating one - we just need one that's bound to the
the source port of this reply packet in the init namespace. We can look
this up in udp_splice_to_init[v6][src].target_sock, whose defining
characteristic is exactly that. The same applies with init and ns swapped.
In practice, of course, the port we locate this way will always be the
originating port, since we couldn't have started this "connection" if it
wasn't.
Change this, and we no longer need the @orig_sock field at all. That
leaves just @target_sock which we rename to simply @sock. The whole
udp_splice_flow structure now more represents a single bound port than
a "flow" per se, so rename and recomment it accordingly. Likewise the
udp_splice_to_{ns,init} names are now misleading, since the ports in
those maps are used in both directions. Rename them to
udp_splice_{ns,init} indicating the location where the described
socket is bound.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we look up udp_splice_to_ns[v6][src].target_sock in
udp_sock_handler_splice, all we really require of the socket is that it
be bound to port src in the pasta guest namespace. Similarly for
udp_splice_to_init but bound in the init namespace.
Usually these sockets are created temporarily by udp_splice_connect() and
cleaned up by udp_timer(). However, depending on the -u and -U options its
possible we have a permanent socket bound to the relevant port created by
udp_sock_init(). If such a socket exists, we could use it instead of
creating a temporary one. In fact we *must* use it, because we'll fail
trying to bind() a temporary one to the same port.
So allow this, store permanently bound sockets into udp_splice_to_{ns,init}
in udp_sock_init(). These won't get incorrectly removed by the timer
because we don't put a corresponding entry in the udp_act[] structure
which directs the timer what to clean up.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For each IP version udp_socket() has 3 possible calls to sock_l4(). One
is for the "non-spliced" bound socket in the init namespace, one for the
"spliced" bound socket in the init namespace and one for the "spliced"
bound socket in the pasta namespace.
However when this is called to create a socket in the pasta namspeace there
is a logic error which causes it to take the path for the init side spliced
socket as well as the ns socket. This essentially tries to create two
identical sockets on the ns side. Unsurprisingly the second bind() call
fails according to strace.
Correct this to only attempt to open one socket within the ns.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The @splice field in union udp_epoll_ref can have a number of values for
different types of "spliced" packet flows. Split it into several single
bit fields with more or less independent meanings. The new @splice field
is just a boolean indicating whether the socket is associated with a
spliced flow, making it identical to the @splice fiend in tcp_epoll_ref.
The new bit @orig, indicates whether this is a socket which can originate
new udp packet flows (created with -u or -U) or a socket created on the
fly to handle reply socket. @ns indicates whether the socket lives in the
init namespace or the pasta namespace.
Making these bits more orthogonal to each other will simplify some future
cleanups.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
| |
We set this field, but nothing ever checked it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we connect() the socket we use to forward spliced UDP flows.
However, we now only ever use sendto() rather than send() on this socket
so there's not actually any need to connect it. Don't do so.
Rename a number of things that referred to "connect" or "conn" since that
would now be misleading.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
udp_sock_handler_splice() has two different ways of sending out packets
once it has determined the correct destination socket. For the originating
sockets (which are not connected) it uses sendto() to specify a specific
address. For the forward socket (which is connected) we use send().
However we know the correct destination address even for the forward socket
we do also know the correct destination address. We can use this to use
sendto() instead of send(), removing the need for two different paths and
some staging data structures.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Each entry udp_splice_map[v6][N] keeps information about two essentially
unrelated packet flows. @ns_conn_sock, @ns_conn_ts and @init_bound_sock
track a packet flow from port N in the host init namespace to some other
port in the pasta namespace (the one @ns_conn_sock is connected to).
@init_conn_sock, @init_conn_ts and @ns_bound_sock track packet flow from
port N in the pasta namespace to some other port in the host init namespace
(the one @init_conn_sock is connected to).
Split udp_splice_map[][] into two separate tables for the two directions.
Each entry in each table is a 'struct udp_splice_flow' with @orig_sock
(previously the bound socket), @target_sock (previously the connected
socket) and @ts (the timeout for the target socket).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
pasta handles "spliced" port forwarding by resending datagrams received on
a bound socket in the init namespace to a connected socket in the guest
namespace. This means there are actually three ports associated with each
"connection". First there's the source and destination ports of the
originating datagram. That's also the destination port of the forwarded
datagram, but the source port of the forwarded datagram is the kernel
allocated bound address of the connected socket.
However, by bind()ing as well as connect()ing the forwarding socket we can
choose the source port of the forwarded datagrams. By choosing it to match
the original source port we remove that surprising third port number and
no longer need to store port numbers in struct udp_splice_port.
As a bonus this means that the recipient of the packets will see the
original source port if they call getpeername(). This rarely matters, but
it can't hurt.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the case where the client writes a packet and then closes the
socket, because we receive EPOLLIN|EPOLLRDHUP together we have a
choice of whether to close the socket immediately, or read the packet
and then close the socket. Choose the latter.
This should improve fuzzing coverage and arguably is a better choice
even for regular use since dropping packets on close is bad.
See-also: https://archives.passt.top/passt-dev/20221117171805.3746f53a@elisabeth/
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
This passes a fully connected stream socket to passt.
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
[sbrivio: reuse fd_tap instead of adding a new descriptor,
imply --one-off on --fd, add to optstring and usage()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
These files are left around by emacs amongst other editors.
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
If you run the build several times it will fail unnecessarily with:
ln -s passt pasta
ln: failed to create symbolic link 'pasta': File exists
make: *** [Makefile:134: pasta] Error 1
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
The pointers are actually the same, but we later pass the container
union to tcp_table_compact(), which might zero the size of the whole
union, and this confuses Coverity Scan.
Given that we have pointers to the container union to start with,
just pass those instead, all the way down to tcp_table_compact().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as
well as native IPv6 connections. By doing this we halve the number of
listening sockets we need for TCP (assuming passt/pasta is listening on the
same ports for IPv4 and IPv6). When forwarding many ports (e.g. -t all)
this can significantly reduce the amount of kernel memory that passt
consumes.
When forwarding all TCP and UDP ports for both IPv4 and IPv6 (-t all
-u all), this reduces kernel memory usage from ~677MiB to ~487MiB
(kernel version 6.0.8 on Fedora 37, x86_64).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
According to its doc comments, sock_l4() returns -1 on error. It does,
except in one case where it returns -EIO. Fix this inconsistency to match
the docs and always return -1.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, when instructed to open an IPv6 socket, sock_l4() explicitly
sets the IPV6_V6ONLY socket option so that the socket will only respond to
IPv6 connections. Linux (and probably other platforms) allow "dual stack"
sockets: IPv6 sockets which can also accept IPv4 connections.
Extend sock_l4() to be able to make such sockets, by passing AF_UNSPEC as
the address family and no bind address (binding to a specific address would
defeat the purpose). We add a Makefile define 'DUAL_STACK_SOCKETS' to
indicate availability of this feature on the target platform.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Previous cleanups mean that tcp_sock_init4() and tcp_sock_init6() are
almost identical, and the remaining differences can be easily
parameterized. Combine both into a single tcp_sock_init_af() function.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
For non-spliced connections we now treat IPv4-mapped IPv6 addresses the
same as the corresponding IPv4 addresses. However currently we won't
splice a connection from ::ffff:127.0.0.1 the way we would one from
127.0.0.1. Correct this so that we can splice connections from IPv4
localhost that have been received on an IPv6 dual stack socket.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
passt usually doesn't NAT, but it does do so for the remapping of the
gateway address to refer to the host. Currently we perform this NAT with
slightly different rules on both IPv4 addresses and IPv6 addresses, but not
on IPv4-mapped IPv6 addresses. This means we won't correctly handle the
case of an IPv4 connection over an IPv6 socket, which is possible on Linux
(and probably other platforms).
Refactor tcp_conn_from_sock() to perform the NAT after converting either
address family into an inany_addr, so IPv4 and and IPv4-mapped addresses
have the same representation.
With two new helpers this lets us remove the IPv4 and IPv6 specific paths
from tcp_conn_from_sock().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bit in the TCP specific epoll reference indicates whether the
connection is IPv6 or IPv4. However the sites which refer to it are
already calling accept() which (optionally) returns an address for the
remote end of the connection. We can use the sa_family field in that
address to determine the connection type independent of the epoll
reference.
This does have a cost: for the spliced case, it means we now need to get
that address from accept() which introduces an extran copy_to_user().
However, in future we want to allow handling IPv4 connectons through IPv6
sockets, which means we won't be able to determine the IP version at the
time we create the listening socket and epoll reference. So, at some point
we'll have to pay this cost anyway.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It looks like tcp_seq_init() is supposed to advance the sequence number
by one every 32ns. However we only right shift the ns part of the timespec
not the seconds part, meaning that we'll advance by an extra 32 steps on
each second.
I don't know if that's exploitable in any way, but it doesn't appear to be
the intent, nor what RFC 6528 suggests.
In addition, we convert from seconds to nanoseconds with a multiplication
by '1E9'. In C '1E9' is a floating point constant, forcing a conversion
to floating point and back for what should be an integer calculation
(confirmed with objdump and Makefile default compiler flags). Spell out
1000000000 in full to avoid that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_seq_init() takes a number of parameters for the connection, but at
every call site, these are already populated in the tcp_conn structure.
Likewise we always store the result into the @seq_to_tap field.
Use this to simplify tcp_seq_init().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_seq_init() has separate paths for IPv4 and IPv6 addresses, which means
we will calculate different sequence numbers for IPv4 and equivalent
IPv4-mapped IPv6 addresses.
Change it to treat these the same by always converting the input address
into an inany_addr representation and use that to calculate the sequence
number.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_hash_match() can take either an IPv4 (struct in_addr) or IPv6 (struct
in6_addr) address. It has two different paths for each of those cases.
However, its only caller has already constructed an equivalent inany
representation of the address, so we can have tcp_hash_match take that
directly and use a simpler comparison with the inany_equals() helper.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_hash_insert() takes an address to control which hash bucket the
connection will go into. However, an inany_addr representation of that
address is already stored in struct tcp_conn.
Now that we've made the hashing of IPv4 and IPv4-mapped IPv6 addresses
equivalent, we can simplify tcp_hash_insert() to use the address in
struct tcp_conn, rather than taking it as an extra parameter.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the tcp_conn structure, we represent the address with an inany_addr
which could be an IPv4 or IPv6 address. However, we have different paths
which will calculate different hashes for IPv4 and equivalent IPv4-mapped
IPv6 addresses. This will cause problems for some future changes.
Make the hash function work the same for these two cases, by taking an
inany_addr directly. Since this represents IPv4 and IPv4-mapped IPv6
addresses the same way, it will trivially hash the same for both cases.
Callers are changed to construct an inany_addr from whatever they have.
Some of that will be elided in later changes.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
struct tcp_conn stores an address which could be IPv6 or IPv4 using a
union. We can do this without an additional tag by encoding IPv4 addresses
as IPv4-mapped IPv6 addresses.
This approach is useful wider than the specific place in tcp_conn, so
expose a new 'union inany_addr' like this from a new inany.h. Along with
that create a number of helper functions to make working with these "inany"
addresses easier.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently when we insert a connection into the hash table, we store its
bucket number so we can find it when removing entries. However, we can
recompute the hash value from other contents of the structure so we don't
need to store it. This brings the size of tcp_tap_conn down to 64 bytes
again, which means it will fit in a single cacheline on common machines.
This change also removes a non-obvious constraint that the hash table have
less than twice TCP_MAX_CONNS buckets, because of the way
TCP_HASH_BUCKET_BITS was constructed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently the epoll reference for tcp sockets includes a bit indicating
whether the socket maps to a spliced connection. However, the reference
also has the index of the connection structure which also indicates whether
it is spliced. We can therefore avoid the splice bit in the epoll_ref by
unifying the first part of the non-spliced and spliced handlers where we
look up the connection state.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In pasta mode, tcp_sock_init[46]() create separate sockets to listen for
spliced connections (these are bound to localhost) and non-spliced
connections (these are bound to the host address). This introduces a
subtle behavioural difference between pasta and passt: by default, pasta
will listen only on a single host address, whereas passt will listen on
all addresses (0.0.0.0 or ::). This also prevents us using some additional
optimizations that only work with the unspecified (0.0.0.0 or ::) address.
However, it turns out we don't need to do this. We can splice a connection
if and only if it originates from the loopback address. Currently we
ensure this by having the "spliced" listening sockets listening only on
loopback. Instead, defer the decision about whether to splice a connection
until after accept(), by checking if the connection was made from the
loopback address.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In tcp_sock_handler() we split off to handle spliced sockets before
checking anything else. However the first steps of the "new connection"
path for each case are the same: allocate a connection entry and accept()
the connection.
Remove this duplication by making tcp_conn_from_sock() handle both spliced
and non-spliced cases, with help from more specific tcp_tap_conn_from_sock
and tcp_splice_conn_from_sock functions for the later stages which differ.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_sock_init*() can create either sockets listening on the host, or in
the pasta network namespace (with @ns==1). There are, however, a number
of differences in how these two cases work in practice though. "ns"
sockets are only used in pasta mode, and they always lead to spliced
connections only. The functions are also only ever called in "ns" mode
with a NULL address and interface name, and it doesn't really make sense
for them to be called any other way.
Later changes will introduce further differences in behaviour between these
two cases, so it makes more sense to use separate functions for creating
the ns listening sockets than the regular external/host listening sockets.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is very little common between the tcp_tap_conn and tcp_splice_conn
structures. However, both do have an IN_EPOLL flag which has the same
meaning in each case, though it's stored in a different location.
Simplify things slightly by moving this bit into the common header of the
two structures.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
These two functions scan all the non-splced and spliced connections
respectively and perform timed updates on them. Avoid scanning the now
unified table twice, by having tcp_timer scan it once calling the
relevant per-connection function for each one.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
These two functions each step through non-spliced and spliced connections
respectively and clean up entries for closed connections. To avoid
scanning the connection table twice, we merge these into a single function
which scans the unified table and performs the appropriate sort of cleanup
action on each one.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently spliced and non-spliced connections are stored in completely
separate tables, so there are completely independent limits on the number
of spliced and non-spliced connections. This is a bit counter-intuitive.
More importantly, the fact that the tables are separate prevents us from
unifying some other logic between the two cases. So, merge these two
tables into one, using the 'c.spliced' common field to distinguish between
them when necessary.
For now we keep a common limit of 128k connections, whether they're spliced
or non-spliced, which means we save memory overall. If necessary we could
increase this to a 256k or higher total, which would cost memory but give
some more flexibility.
For now, the code paths which need to step through all extant connections
are still separate for the two cases, just skipping over entries which
aren't for them. We'll improve that in later patches.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
When we compact the connection tables (both spliced and non-spliced) we
need to move entries from one slot to another. That requires some updates
in the entries themselves. Add helpers to make all the necessary updates
for the spliced and non-spliced cases. This will simplify later cleanups.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, the tables for spliced and non-spliced connections are entirely
separate, with different types in different arrays. We want to unify them.
As a first step, create a union type which can represent either a spliced
or non-spliced connection. For them to be distinguishable, the individual
types need to have a common header added, with a bit indicating which type
this structure is.
This comes at the cost of increasing the size of tcp_tap_conn to over one
(64 byte) cacheline. This isn't ideal, but it makes things simpler for now
and we'll re-optimize this later.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently spliced and non-spliced connections use completely independent
tracking structures. We want to unify these, so as a preliminary step move
the definitions for both variants into a new tcp_conn.h header, shared by
tcp.c and tcp_splice.c.
This requires renaming some #defines with the same name but different
meanings between the two cases. In the process we correct some places that
are slightly out of sync between the comments and the code for various
event bit names.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Like we already have for non-spliced connections, create a CONN_IDX()
macro for looking up the index of spliced connection structures. Change
the name of the array of spliced connections to be different from that for
non-spliced connections (even though they're in different modules). This
will make subsequent changes a bit safer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|