| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux ICMP "ping" sockets are very specific in what they do. They let
userspace send ping requests (ICMP_ECHO or ICMP6_ECHO_REQUEST), and receive
matching replies (ICMP_ECHOREPLY or ICMP6_ECHO_REPLY). They don't let you
intercept or handle incoming ping requests.
In the case of passt/pasta that means we can process echo requests from tap
and forward them to a ping socket, then take the replies from the ping
socket and forward them to tap. We can't do the reverse: take echo
requests from the host and somehow forward them to the guest. There's
really no way for something outside to initiate a ping to a passt/pasta
connected guest and if there was we'd need an entirely different mechanism
to handle it.
However, we have some logic to deal with packets going in that reverse
direction. Remove it, since it can't ever be used that way. While we're
there use defines for the ICMPv6 types, instead of open coded type values.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
We initialise the address portion of the sockaddr for sendto() to the
unspecified address, but then always overwrite it with the actual
destination address before we call the sendto().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We set the port to the ICMP id on the sendto() address when using ICMP
ping sockets. However, this has no effect: the ICMP id the kernel
uses is determined only by the "port" on the socket's *bound* address
(which is constructed inside sock_l4(), using the id we also pass to
it).
For unclear reasons this change triggers cppcheck 2.13.0 to give new
"variable could be const pointer" warnings, so make *ih const as well to
fix that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we always keep the flow table maximally compact: that is all the
active entries are contiguous at the start of the table. Doing this
sometimes requires moving an entry when one is freed. That's kind of
fiddly, and potentially expensive: it requires updating the hash table for
the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
system calls to update epoll tags with the new location too.
Implement a new way of managing the flow table that doesn't ever move
entries. It attempts to maintain some compactness by always using the
first free slot for a new connection, and mitigates the effect of non
compactness by cheaply skipping over contiguous blocks of free entries.
See the "theory of operation" comment in flow.c for details.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b
[sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid
Coverity Scan false positive]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, flows are only evern finally freed (and the table compacted)
from the deferred handlers. Some future ways we want to optimise managing
the flow table will rely on this, so enforce it: rather than having the
TCP code directly call flow_table_compact(), add a boolean return value to
the per-flow deferred handlers. If true, this indicates that the flow
code itself should free the flow.
This forces all freeing of flows to occur during the flow code's scan of
the table in flow_defer_handler() which opens possibilities for future
optimisations.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently tcp.c open codes the process of allocating a new flow from the
flow table: twice, in fact, once for guest to host and once for host to
guest connections. This duplication isn't ideal and will get worse as we
add more protocols to the flow table. It also makes it harder to
experiment with different ways of handling flow table allocation.
Instead, introduce a function to allocate a new flow: flow_alloc(). In
some cases we currently check if we're able to allocate, but delay the
actual allocation. We now handle that slightly differently with a
flow_alloc_cancel() function to back out a recent allocation. We have that
separate from a flow_free() function, because future changes we have in
mind will need to handle this case a little differently.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In general, the passt code is a bit haphazard about what's a true global
variable and what's in the quasi-global 'context structure'. The
flow_count field is one such example: it's in the context structure,
although it's really part of the same data structure as flowtab[], which
is a genuine global.
Move flow_count to be a regular global to match. For now it needs to be
public, rather than static, but we expect to be able to change that in
future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
flow_log_() is a very basic widely used function that many other functions
in flow.c will end up needing. At present it's below flow_table_compact()
which happens not to need it, but that's likely to change. Move it to
near the top of flow.c to avoid forward declarations.
Code motion only, no changes.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Currently connected TCP sockets have the same epoll type, whether they're
for a "tap" connection or a spliced connection. This means that
tcp_sock_handler() has to do a secondary check on the type of the
connection to call the right function. We can avoid this by adding a new
epoll type and dispatching directly to the right thing.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of
EPOLL_TYPE_MAX, which is a little bit safer and clearer. Add a static
assert on the size of the matching names array.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_timer() scans the flow table so that it can run tcp_splice_timer() on
each spliced connection. More generally, other flow types might want to
run similar timers in future.
We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc.
However, this would need to scan the flow table, which we would have just
done in flow_defer_handler(). We'd prefer to just scan the flow table
once, dispatching both per-flow deferred events and per-flow timed events
if necessary.
So, extend flow_defer_handler() to do this. For now we use the same timer
interval for all flow types (1s). We can make that more flexible in future
if we need to.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_defer_handler(), amongst other things, scans the flow table and does
some processing for each TCP connection. When we add other protocols to
the flow table, they're likely to want some similar scanning. It makes
more sense for cache friendliness to perform a single scan of the flow
table and dispatch to the protocol specific handlers, rather than having
each protocol separately scan the table.
To that end, add a new flow_defer_handler() handling all flow-linked
deferred operations.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally
on the connection being closed or closing. Move that logic into the
"destroy" functions themselves, renaming them tcp_flow_defer() and
tcp_splice_flow_defer().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_timer() scans the connection table, expiring "tap" connections and
calling tcp_splice_timer() for "splice" connections. tcp_splice_timer()
expires spliced connections and then does some other processing.
However, tcp_timer() is always called shortly after tcp_defer_handler()
(from post_handler()), which also scans the flow table expiring both tap
and spliced connections. So remove the redundant handling, and only do
the extra tcp_splice_timer() work from tcp_timer().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
In a number of places we pass around a struct timespec representing the
(more or less) current time. Sometimes we call it 'now', and sometimes we
call it 'ts'. Standardise on the more informative 'now'.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
flow_table.h, the lower level flow header relies on having the struct
definitions for every protocol specific flow type - so far that means
tcp_conn.h. It doesn't include it itself, so tcp_conn.h must be included
before flow_table.h.
That's ok for now, but as we use the flow table for more things,
flow_table.h will need the structs for all of them, which means the
protocol specific .c files would need to include tcp_conn.h _and_ the
equivalents for every other flow type before flow_table.h every time,
which is weird.
So, although we *mostly* lean towards the include style where .c files need
to handle the include dependencies, in this case it makes more sense to
have flow_table.h include all the protocol specific headers it needs.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
pif_name() has no current callers, although we expect some as we expand the
flow table support. I'm not sure why this didn't get caught by one of
our static checkers earlier, but it's now causing cppcheck failures for me.
Add a cppcheck suppression.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another
warning for pointer variables which could be pointer to const but aren't.
Use this to make a bunch of variables const pointers where they previously
weren't for no particular reason.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
f0ccca74 ("test: make passt.mbuto script more robust") is supposed to make
mbuto more robust by standardizing on always putting things in /usr/sbin
with /sbin a symlink to it. This matters because different distros have
different conventions about how the two are used.
However, the logic there requires that /usr/sbin at least exists to start
with. This isn't always the case with Fedora derived mbuto images.
Ironically the DIRS variable ensures that /sbin exists, although we then
remove it, but doesn't require /usr/sbin to exist. Fix that up so that
the new logic will work with Fedora.
Fixes: f0ccca741f64 ("test: make passt.mbuto script more robust")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This happened in most cases implicitly before commit eff3bcb24547
("netlink: Split nl_addr() into separate operation functions"): while
going through results from netlink, we would only copy an address
into the provided return buffer if no address had been picked yet.
Because of the insertion logic in the kernel (ipv6_link_dev_addr()),
the first returned address would also be the one added last, and, in
case of a Linux guest using a DHCPv6 client as well as SLAAC, that
would be the address assigned via DHCPv6, because SLAAC happens
before the DHCPv6 exchange.
The effect of, instead, picking the last returned address (first
assigned) is visible when passt or pasta runs nested, given that, by
default, they advertise a prefix for SLAAC usage, plus an address via
DHCPv6.
The first level (L1 guest) would get a /64 address by means of SLAAC,
and a /128 address via DHCPv6, the latter matching the address on the
host.
The second level (L2 guest) would also get two addresses: a /64 via
SLAAC (same prefix as the host), and a /128 via DHCPv6, matching the
the L1 SLAAC-assigned address, not the one obtained via DHCPv6. That
is, none of the L2 addresses would match the address on the host. The
whole point of having a DHCPv6 server is to avoid (implicit) NAT when
possible, though.
Fix this in a more explicit way than the behaviour we initially had:
pick the first address among the set of most specific ones, by
comparing prefix lengths. Do this for IPv4 and for link-local
addresses, too, to match in any case the implementation of the
default source address selection.
Reported-by: Yalan Zhang <yalzhang@redhat.com>
Fixes: eff3bcb24547 ("netlink: Split nl_addr() into separate operation functions")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
| |
Reported-by: Yalan Zhang <yalzhang@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
| |
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Creation of a symbolic link from /sbin to /usr/sbin fails if /sbin
exists and is non-empty. This is the case on Ubuntu-23.04.
We fix this by removing /sbin before creating the link.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
e5eefe77435a ("tcp: Refactor to use events instead of states, split out
spliced implementation") has exported tcp_sock_set_bufsize() to
be able to use it in tcp_splice.c, but 6ccab72d9b40 has removed its use
in tcp_splice.c, so we can set it static again.
Fixes: 6ccab72d9b40 ("tcp: Improve handling of fallback if socket pool is empty on new splice")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
sock_l4() takes NULL for ifname if you don't want to bind the socket to a
particular interface. However, for a number of the callers, it's more
natural to use an empty string for that case. Change sock_l4() to accept
either NULL or an empty string equivalently, and simplify some callers
using that change.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
IPv4 addresses can be stored in an in_addr_t or a struct in_addr. The
former is just a type alias to a 32-bit integer, so doesn't really give us
any type checking. Therefore we generally prefer the structure, since we
mostly want to treat IP address as opaque objects. Fix a few places where
we still use in_addr_t, but can just as easily use struct in_addr.
Note there are still some uses of in_addr_t in conf.c, but those are
justified: since they're doing prefix calculations, they actually need to
look at the internals of the address as an integer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
We go to some trouble, if the configured output address is unspecified, to
pass NULL to sock_l4(). But while passing NULL is one way to get sock_l4()
not to specify a bind address, passing the "any" address explicitly works
too. Use this to simplify icmp_tap_handler().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original commit message says:
---
Currently we initialise the address field of the sockaddrs we construct
to the any/unspecified address, but not in a very clear way: we use
explicit 0 values, which is only interpretable if you know the order of
fields in the sockaddr structures. Use explicit field names, and explicit
initialiser macros for the address.
Because we initialise to this default value, we don't need to explicitly
set the any/unspecified address later on if the caller didn't pass an
overriding bind address.
---
and the original patch modified the initialisation of addr4 and
addr6:
- instead of { 0 }, { 0 } for sin_addr and sin_zero,
.sin_addr = IN4ADDR_ANY_INIT
- instead of 0, IN6ADDR_ANY_INIT, 0:
.sin6_addr = IN6ADDR_ANY_INIT
but I dropped those hunks: they break gcc versions 7 to 9 as reported
in eed6933e6c29 ("udp: Explicitly initialise sin6_scope_id and
sin_zero in sockaddr_in{,6}").
I applied the rest of the changes.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Dropped first two hunks]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
We might as well when we're passing a known constant value, giving the
compiler the best chance to optimise things away.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to
the loopback address, make a similar one for the unspecified / any address.
This avoids messying things with the internal structure of struct in_addr
where we don't care about it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to
the loopback address without delving into its internals. However there are
some places we don't use it, and explicitly look at the internal structure
of struct in_addr, which we generally want to avoid. Use the define more
widely to avoid that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This takes a struct in_addr * (i.e. an IPv4 address), although it's
explicitly supposed to handle IPv6 as well. Both its caller and sock_l4()
which it calls use a void * for the address, which can be either an in_addr
or an in6_addr.
We get away with this, because we don't do anything with the pointer other
than transfer it from the caller to sock_l4(), but it's misleading. And
quite possibly technically UB, because C is like that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
In most places where we need to get ICMP definitions, we get them from
<netinet/ip_icmp.h>. However in checksum.c we instead include
<linux/icmp.h>. Change it to use <netinet/ip_icmp.h> for consistency.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently tcp_hash() returns the hash bucket for a value, that is the hash
modulo the size of the hash table. Usually it's a bit more flexible to
have hash functions return a "raw" hash value and perform the modulus in
the callers. That allows the same hash function to be used for multiple
tables of different sizes, or to re-use the hash for other purposes.
We don't do anything like that with tcp_hash() at present, but we have some
plans to do so. Prepare for that by making tcp_hash() and tcp_conn_hash()
return raw hash values.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We implement our hash table with pointers to the entry for each bucket (or
NULL). However, the entries are always allocated within the flow table,
meaning that a flow index will suffice, halving the size of the hash table.
For TCP, just a flow index would be enough, but future uses will want to
expand the hash table to cover indexing either side of a flow, so use a
flow_sidx_t as the type for each hash bucket.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we deal with hash collisions by letting a hash bucket contain
multiple entries, forming a linked list using an index in the connection
structure.
That's a pretty standard and simple approach, but in our case we can use
an even simpler one: linear probing. Here if a hash bucket is occupied
we just move onto the next one until we find a feww one. This slightly
simplifies lookup and more importantly saves some precious bytes in the
connection structure by removing the need for a link. It does require some
additional complexity for hash removal.
This approach can perform poorly with hash table load is high. However, we
already size our hash table of pointers larger than the connection table,
which puts an upper bound on the load. It's relatively cheap to decrease
that bound if we find we need to.
I adapted the linear probing operations from Knuth's The Art of Computer
Programming, Volume 3, 2nd Edition. Specifically Algorithm L and Algorithm
R in Section 6.4. Note that there is an error in Algorithm R as printed,
see errata at [0].
[0] https://www-cs-faculty.stanford.edu/~knuth/all3-prepre.ps.gz
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_hash_lookup() expects the port numbers in host order, but the TCP
header, of course, has them in network order, so we need to switch them.
However we call htons() (host to network) instead of ntohs() (network to
host). This works because those do the same thing in practice (they only
wouldn't on very strange theoretical platforms which are neither big nor
little endian).
But, having this the "wrong" way around is misleading, so switch it around.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
It's been a while -- there are now official packages for Arch Linux,
Gentoo, Void Linux.
Suggested-by: Rahil Bhimjiani <me@rahil.website>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
On x32, glibc defines time_t and suseconds_t (the latter, also known as
__syscall_slong_t) as unsigned long long, whereas "everywhere else",
including x86_64 and i686, those are unsigned long.
See also https://sourceware.org/bugzilla/show_bug.cgi?id=16437 for
all the gory details.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
| |
Reported-by: lemmi <lemmi@nerd2nerd.org>
Link: https://github.com/void-linux/void-packages/actions/runs/7097192513/job/19316903568
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we run passt nested (a guest connected via passt to a guest
connected via passt to the host), the first guest (L1) typically has
two IPv6 addresses on the same interface: one formed from the prefix
assigned via SLAAC, and another one assigned via DHCPv6 (to match the
address on the host).
When we select addresses for comparison, in this case, we have
multiple global unicast addresses -- again, on the same interface.
Selecting the first reported one on both host and guest is not
entirely correct (in theory, the order might differ), but works
reasonably well.
Use the trick from 5beef085978e ("test: Only select a single
interface or gateway in tests") to ask jq(1) for the first address
returned by the query.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, we have no mechanism to dynamically update IPv6
addressing, routing or DNS information (which should eventually be
implemented via netlink monitor), so it makes no sense to limit
lifetimes of NDP information to any particular value.
If we do, with common configurations of systemd-networkd in a guest,
we can end up in a situation where we have a /128 address assigned
via DHCPv6, the NDP-assigned prefix expires, and the default route
also expires. However, as there's a valid address, the prefix is
not renewed. As a result, the default route becomes invalid and we
lose it altogether, which implies that the guest loses IPv6
connectivity except for link-local communication.
Set the router lifetime to the maximum allowed by RFC 8319, that is,
65535 seconds (about 18 hours). RFC 4861 limited this value to 9000
seconds, but RFC 8319 later updated this limit.
Set prefix and DNS information lifetime to infinity. This is allowed
by RFC 4861 and RFC 8319.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When using the old-style "pane" methods of executing commands during the
tests, we need to scan the shell output for prompts in order to tell when
commands have finished. This is inherently unreliable because commands
could output things that look like prompts, and prompts might not look like
we expect them to. The only way to really fix this is to use a better way
of dispatching commands, like the newer "context" system.
However, it's awkward to convert everything to "context" right at the
moment, so we're still relying on some tests that do work most of the time.
It is, however, particularly sensitive to fancy coloured prompts using
escape sequences. Currently we try to handle this by stripping actual
ESC characters with tr, then looking for some common variants.
We can do a bit better: instead strip all escape sequences using sed before
looking for our prompt. Or, at least, any one using [a-zA-Z] as the
terminating character. Strictly speaking ANSI escapes can be terminated by
any character in 0x40..0x7e, which isn't easily expressed in a regexp.
This should capture all common ones, though.
With this transformation we can simplify the list of patterns we then look
for as a prompt, removing some redundant variants.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a TCP connection is closed, we mark it by setting events to CLOSED,
then some time later we do final cleanups: closing sockets, removing from
the hash table and so forth.
This does mean that when making a hash lookup we need to exclude any
apparent matches that are CLOSED, since they represent a stale connection.
This can happen in practice if one connection closes and a new one with the
same endpoints is started shortly afterward.
Checking for CLOSED is quite specific to TCP however, and won't work when
we extend the hash table to more general flows. So, alter the code to
immediately remove the connection from the hash table when CLOSED, although
we still defer closing sockets and other cleanup.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The TCP state structure includes a 128-bit hash_secret which we use for
SipHash calculations to mitigate attacks on the TCP hash table and initial
sequence number.
We have plans to use SipHash in places that aren't TCP related, and there's
no particular reason they'd need their own secret. So move the hash_secret
to the general context structure.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Future debugging will want to identify a specific passt interface. We make
a distinction in these helpers between the name of the *type* of pif, and
name of the pif itself. For the time being these are always the same
thing, since we have at most instance of each type of pif. However, that
might change in future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In test/prepare-distro-img.sh we use guestfish to tweak our distro guest
images to be suitable. Part of this is using a 'copy-in' directive to copy
in the source files for passt itself. Currently we copy in all the files
with a single 'copy-in', since it allows listing multiple files. However
it turns out that the number of arguments it can accept is fairly limited
and our current list of files is already very close to that limit.
Instead, expand our list of files to one copy-in per file, avoiding that
limitation. This isn't much slower, because all the commands still run in
a single invocation of guestfish itself.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently TCP uses the 'flow' epoll_ref field for both connected
sockets and timers, which consists of just the index of the relevant
flow (connection).
This is just fine for timers, for while it obviously works, it's
subtly incomplete for sockets on spliced connections. In that case we
want to know which side of the connection the event is occurring on as
well as which connection. At present, we deduce that information by
looking at the actual fd, and comparing it to the fds of the sockets
on each side.
When we use the flow table for more things, we expect more cases where
something will need to know a specific side of a specific flow for an
event, but nothing more.
Therefore add a new 'flowside' epoll_ref field, with exactly that
information. We use it for TCP connected sockets. This allows us to
directly know the side for spliced connections. For "tap"
connections, it's pretty meaningless, since the side is always the
socket side. It still makes logical sense though, and it may become
important for future flow table work.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, we use 'int' values to represent the "side" of a connection,
which must always be 0 or 1. This turns out to be dangerous.
In some cases we're going to want to put the side into a 1-bit bitfield.
However, if that bitfield has type 'int', when we copy it out to a regular
'int' variable, it will be sign-extended and so have values 0 and -1,
instead of 0 and 1.
To avoid this, always use unsigned variables for the side.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
TCP uses three different epoll object types: one for connected sockets, one
for timers and one for listening sockets. Listening sockets really need
information that's specific to TCP, so need their own epoll_ref field.
Timers and connected sockets, however, only need the connection (flow)
they're associated with. As we expand the use of the flow table, we expect
that to be true for more epoll fds. So, rename the "TCP" epoll_ref field
to be a "flow" epoll_ref field that can be used both for TCP and for other
future cases.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|