| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
Sometimes we use sa_family_t for variables and parameters containing a
socket address family, other times we use a plain int. Since sa_family_t
is what's actually used in struct sockaddr and friends, standardise on
that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a duplicate ack from the tap side triggers a fast re-transmit, we set
both conn->seq_ack_from_tap and conn->seq_to_tap to the sequence number of
the duplicate ack. Setting seq_to_tap is correct: this is what triggers
the retransmit from this point onwards. Setting seq_ack_from_tap is
not correct, though.
In most cases setting seq_ack_from_tap will be redundant but harmless:
it will have already been updated to the same value by
tcp_update_seqack_from_tap() a few lines above. However that call can
be skipped if tcp_sock_consume() fails, which is rare but possible. In
that case this update will cause problems.
We use seq_ack_from_tap to track two logically distinct things: how much of
the stream has been acked by the guest, and how much of the stream from the
socket has been read and discarded (as opposed to MSG_PEEKed). We attempt
to keep those values the same, because we discard data exactly when it is
acked by the guest. However tcp_sock_consume() failing means we weren't
able to disard the acked data. To handle that case, we skip the usual
update of seq_ack_from_tap, effectively ignoring the ack assuming we'll get
one which supersedes it soon enough. Setting seq_ack_from_tap in the
fast retransmit path, however, means we now really will have the
read/discard point in the stream out of sync with seq_ack_from_tap.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we always keep the flow table maximally compact: that is all the
active entries are contiguous at the start of the table. Doing this
sometimes requires moving an entry when one is freed. That's kind of
fiddly, and potentially expensive: it requires updating the hash table for
the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
system calls to update epoll tags with the new location too.
Implement a new way of managing the flow table that doesn't ever move
entries. It attempts to maintain some compactness by always using the
first free slot for a new connection, and mitigates the effect of non
compactness by cheaply skipping over contiguous blocks of free entries.
See the "theory of operation" comment in flow.c for details.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b
[sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid
Coverity Scan false positive]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, flows are only evern finally freed (and the table compacted)
from the deferred handlers. Some future ways we want to optimise managing
the flow table will rely on this, so enforce it: rather than having the
TCP code directly call flow_table_compact(), add a boolean return value to
the per-flow deferred handlers. If true, this indicates that the flow
code itself should free the flow.
This forces all freeing of flows to occur during the flow code's scan of
the table in flow_defer_handler() which opens possibilities for future
optimisations.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently tcp.c open codes the process of allocating a new flow from the
flow table: twice, in fact, once for guest to host and once for host to
guest connections. This duplication isn't ideal and will get worse as we
add more protocols to the flow table. It also makes it harder to
experiment with different ways of handling flow table allocation.
Instead, introduce a function to allocate a new flow: flow_alloc(). In
some cases we currently check if we're able to allocate, but delay the
actual allocation. We now handle that slightly differently with a
flow_alloc_cancel() function to back out a recent allocation. We have that
separate from a flow_free() function, because future changes we have in
mind will need to handle this case a little differently.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In general, the passt code is a bit haphazard about what's a true global
variable and what's in the quasi-global 'context structure'. The
flow_count field is one such example: it's in the context structure,
although it's really part of the same data structure as flowtab[], which
is a genuine global.
Move flow_count to be a regular global to match. For now it needs to be
public, rather than static, but we expect to be able to change that in
future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Currently connected TCP sockets have the same epoll type, whether they're
for a "tap" connection or a spliced connection. This means that
tcp_sock_handler() has to do a secondary check on the type of the
connection to call the right function. We can avoid this by adding a new
epoll type and dispatching directly to the right thing.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_timer() scans the flow table so that it can run tcp_splice_timer() on
each spliced connection. More generally, other flow types might want to
run similar timers in future.
We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc.
However, this would need to scan the flow table, which we would have just
done in flow_defer_handler(). We'd prefer to just scan the flow table
once, dispatching both per-flow deferred events and per-flow timed events
if necessary.
So, extend flow_defer_handler() to do this. For now we use the same timer
interval for all flow types (1s). We can make that more flexible in future
if we need to.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_defer_handler(), amongst other things, scans the flow table and does
some processing for each TCP connection. When we add other protocols to
the flow table, they're likely to want some similar scanning. It makes
more sense for cache friendliness to perform a single scan of the flow
table and dispatch to the protocol specific handlers, rather than having
each protocol separately scan the table.
To that end, add a new flow_defer_handler() handling all flow-linked
deferred operations.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally
on the connection being closed or closing. Move that logic into the
"destroy" functions themselves, renaming them tcp_flow_defer() and
tcp_splice_flow_defer().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_timer() scans the connection table, expiring "tap" connections and
calling tcp_splice_timer() for "splice" connections. tcp_splice_timer()
expires spliced connections and then does some other processing.
However, tcp_timer() is always called shortly after tcp_defer_handler()
(from post_handler()), which also scans the flow table expiring both tap
and spliced connections. So remove the redundant handling, and only do
the extra tcp_splice_timer() work from tcp_timer().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
In a number of places we pass around a struct timespec representing the
(more or less) current time. Sometimes we call it 'now', and sometimes we
call it 'ts'. Standardise on the more informative 'now'.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
flow_table.h, the lower level flow header relies on having the struct
definitions for every protocol specific flow type - so far that means
tcp_conn.h. It doesn't include it itself, so tcp_conn.h must be included
before flow_table.h.
That's ok for now, but as we use the flow table for more things,
flow_table.h will need the structs for all of them, which means the
protocol specific .c files would need to include tcp_conn.h _and_ the
equivalents for every other flow type before flow_table.h every time,
which is weird.
So, although we *mostly* lean towards the include style where .c files need
to handle the include dependencies, in this case it makes more sense to
have flow_table.h include all the protocol specific headers it needs.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another
warning for pointer variables which could be pointer to const but aren't.
Use this to make a bunch of variables const pointers where they previously
weren't for no particular reason.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
e5eefe77435a ("tcp: Refactor to use events instead of states, split out
spliced implementation") has exported tcp_sock_set_bufsize() to
be able to use it in tcp_splice.c, but 6ccab72d9b40 has removed its use
in tcp_splice.c, so we can set it static again.
Fixes: 6ccab72d9b40 ("tcp: Improve handling of fallback if socket pool is empty on new splice")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to
the loopback address without delving into its internals. However there are
some places we don't use it, and explicitly look at the internal structure
of struct in_addr, which we generally want to avoid. Use the define more
widely to avoid that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This takes a struct in_addr * (i.e. an IPv4 address), although it's
explicitly supposed to handle IPv6 as well. Both its caller and sock_l4()
which it calls use a void * for the address, which can be either an in_addr
or an in6_addr.
We get away with this, because we don't do anything with the pointer other
than transfer it from the caller to sock_l4(), but it's misleading. And
quite possibly technically UB, because C is like that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently tcp_hash() returns the hash bucket for a value, that is the hash
modulo the size of the hash table. Usually it's a bit more flexible to
have hash functions return a "raw" hash value and perform the modulus in
the callers. That allows the same hash function to be used for multiple
tables of different sizes, or to re-use the hash for other purposes.
We don't do anything like that with tcp_hash() at present, but we have some
plans to do so. Prepare for that by making tcp_hash() and tcp_conn_hash()
return raw hash values.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We implement our hash table with pointers to the entry for each bucket (or
NULL). However, the entries are always allocated within the flow table,
meaning that a flow index will suffice, halving the size of the hash table.
For TCP, just a flow index would be enough, but future uses will want to
expand the hash table to cover indexing either side of a flow, so use a
flow_sidx_t as the type for each hash bucket.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we deal with hash collisions by letting a hash bucket contain
multiple entries, forming a linked list using an index in the connection
structure.
That's a pretty standard and simple approach, but in our case we can use
an even simpler one: linear probing. Here if a hash bucket is occupied
we just move onto the next one until we find a feww one. This slightly
simplifies lookup and more importantly saves some precious bytes in the
connection structure by removing the need for a link. It does require some
additional complexity for hash removal.
This approach can perform poorly with hash table load is high. However, we
already size our hash table of pointers larger than the connection table,
which puts an upper bound on the load. It's relatively cheap to decrease
that bound if we find we need to.
I adapted the linear probing operations from Knuth's The Art of Computer
Programming, Volume 3, 2nd Edition. Specifically Algorithm L and Algorithm
R in Section 6.4. Note that there is an error in Algorithm R as printed,
see errata at [0].
[0] https://www-cs-faculty.stanford.edu/~knuth/all3-prepre.ps.gz
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_hash_lookup() expects the port numbers in host order, but the TCP
header, of course, has them in network order, so we need to switch them.
However we call htons() (host to network) instead of ntohs() (network to
host). This works because those do the same thing in practice (they only
wouldn't on very strange theoretical platforms which are neither big nor
little endian).
But, having this the "wrong" way around is misleading, so switch it around.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
On x32, glibc defines time_t and suseconds_t (the latter, also known as
__syscall_slong_t) as unsigned long long, whereas "everywhere else",
including x86_64 and i686, those are unsigned long.
See also https://sourceware.org/bugzilla/show_bug.cgi?id=16437 for
all the gory details.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a TCP connection is closed, we mark it by setting events to CLOSED,
then some time later we do final cleanups: closing sockets, removing from
the hash table and so forth.
This does mean that when making a hash lookup we need to exclude any
apparent matches that are CLOSED, since they represent a stale connection.
This can happen in practice if one connection closes and a new one with the
same endpoints is started shortly afterward.
Checking for CLOSED is quite specific to TCP however, and won't work when
we extend the hash table to more general flows. So, alter the code to
immediately remove the connection from the hash table when CLOSED, although
we still defer closing sockets and other cleanup.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The TCP state structure includes a 128-bit hash_secret which we use for
SipHash calculations to mitigate attacks on the TCP hash table and initial
sequence number.
We have plans to use SipHash in places that aren't TCP related, and there's
no particular reason they'd need their own secret. So move the hash_secret
to the general context structure.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently TCP uses the 'flow' epoll_ref field for both connected
sockets and timers, which consists of just the index of the relevant
flow (connection).
This is just fine for timers, for while it obviously works, it's
subtly incomplete for sockets on spliced connections. In that case we
want to know which side of the connection the event is occurring on as
well as which connection. At present, we deduce that information by
looking at the actual fd, and comparing it to the fds of the sockets
on each side.
When we use the flow table for more things, we expect more cases where
something will need to know a specific side of a specific flow for an
event, but nothing more.
Therefore add a new 'flowside' epoll_ref field, with exactly that
information. We use it for TCP connected sockets. This allows us to
directly know the side for spliced connections. For "tap"
connections, it's pretty meaningless, since the side is always the
socket side. It still makes logical sense though, and it may become
important for future flow table work.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
TCP uses three different epoll object types: one for connected sockets, one
for timers and one for listening sockets. Listening sockets really need
information that's specific to TCP, so need their own epoll_ref field.
Timers and connected sockets, however, only need the connection (flow)
they're associated with. As we expand the use of the flow table, we expect
that to be true for more epoll fds. So, rename the "TCP" epoll_ref field
to be a "flow" epoll_ref field that can be used both for TCP and for other
future cases.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
In tcp_timer_handler() we use conn_at_idx() to interpret the flow index
from the epoll reference. However, this will never be NULL - we always
put a valid index into the epoll_ref. Simplify slightly based on this.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of the messages logged by the TCP code (be they errors, debug or
trace messages) are related to a specific connection / flow. We're fairly
consistent about prefixing these with the type of connection and the
connection / flow index. However there are a few places where we put the
index later in the message or omit it entirely. The template with the
prefix is also a little bulky to carry around for every message,
particularly for spliced connections.
To help keep this consistent, introduce some helpers to log messages
linked to a specific flow. It takes the flow as a parameter and adds a
uniform prefix to each message. This makes things slightly neater now, but
more importantly will help keep formatting consistent as we add more things
to the flow table.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_table_compact() will move entries in the connection/flow table to keep
it compact when other entries are removed. The moved entries need not have
the same type as the flow removed, so it needs to be able to handle moving
any type of flow. Therefore, move it to flow.c rather than being
purportedly TCP specific.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index
of their connection structures in the connection table, now become the
unified flow table. We can easily combine these into a common helper.
While we're there, add some trickery for some additional type safety.
They also define their own CONN() versions, which aren't so easily combined
since they need to return different types, but we can have them use a
common helper.
In the process, we standardise on always using an unsigned type to store
the connection / flow index, which makes more sense. tcp.c's conn_at_idx()
remains for now, but we change its parameter to unsigned to match. That in
turn means we can remove a check for negative values from it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We want to generalise "connection" tracking to things other than true TCP
connections. Continue implenenting this by renaming the TCP connection
table to the "flow table" and moving it to flow.c. The definitions are
split between flow.h and flow_table.h - we need this separation to avoid
circular dependencies: the definitions in flow.h will be needed by many
headers using the flow mechanism, but flow_table.h needs all those protocol
specific headers in order to define the full flow table entry.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently TCP connections use a 1-bit selector, 'spliced', to determine the
rest of the contents of the structure. We want to generalise the TCP
connection table to other types of flows in other protocols. Make a start
on this by replacing the tcp_conn_common structure with a new flow_common
structure with an enum rather than a simple boolean indicating the type of
flow.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In tcp_send_flag(), a4826ee04b76 has replaced:
th->doff = sizeof(*th) / 4;
th->doff += OPT_MSS_LEN / 4;
th->doff += (1 + OPT_WS_LEN) / 4;
by
optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
th->doff = (sizeof(*th) + optlen) / 4;
but forgot to remove the useless "th->doff += (1 + OPT_WS_LEN) / 4;"
Fixes: a4826ee04b76 ("tcp: Defer and coalesce all segments with no data (flags) to handler")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
| |
Types size_t and ssize_t are not necessarily long, it depends on the
architecture.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
valgrind complains if we pass a NULL buffer to recv(), even if we use
MSG_TRUNC, in which case it's actually safe. For a long time we've had
a valgrind suppression for this. It singles out the recv() in
tcp_sock_consume(), the only place we use MSG_TRUNC.
However, tcp_sock_consume() only has a single caller, which makes it a
prime candidate for inlining. If inlined, it won't appear on the stack and
valgrind won't match the correct suppression.
It appears that certain compiler versions (for example gcc-13.2.1 in
Fedora 39) will inline this function even with the -O0 we use for valgrind
builds. This breaks the suppression leading to a spurious failure in the
tests.
There's not really any way to adjust the suppression itself without making
it overly broad (we don't want to match other recv() calls). So, as a hack
explicitly prevent inlining of this function when we're making a valgrind
build. To accomplish this add an explicit -DVALGRIND when making such a
build.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_port_rebind() is desgined to be called from NS_CALL() and has two
disjoint cases: one where it enters the namespace (outbound forwards) and
one where it doesn't (inbound forwards).
We only actually need the NS_CALL() framing for the outbound case, for
inbound we can just call tcp_port_do_rebind() directly. So simplify
tcp_port_rebind() to tcp_port_rebind_outbound(), allowing us to eliminate
an awkward parameters structure.
With that done we can safely rename tcp_port_do_rebind() to
tcp_port_rebind() for brevity.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
| |
tcp_port_rebind() has two cases with almost but not quite identical code.
Simplify things a bit by factoring this out into a single parameterised
helper, tcp_port_do_rebind().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On the L2 tap side, we see TCP headers and know the TCP window that the
ultimate receiver is advertising. In order to avoid unnecessary buffering
within passt/pasta (or by the kernel on passt/pasta's behalf) we attempt
to advertise that window back to the original sock-side sender using
TCP_WINDOW_CLAMP.
However, TCP_WINDOW_CLAMP just doesn't work like this. Prior to kernel
commit 3aa7857fe1d7 ("tcp: enable mid stream window clamp"), it simply
had no effect on established sockets. After that commit, it does affect
established sockets but doesn't behave the way we need:
* It appears to be designed only to shrink the window, not to allow it to
re-expand.
* More importantly, that commit has a serious bug where if the
setsockopt() is made when the existing kernel advertised window for the
socket happens to be zero, it will now become locked at zero, stopping
any further data from being received on the socket.
Since this has never worked as intended, simply remove it. It might be
possible to re-implement the intended behaviour by manipulating SO_RCVBUF,
so we leave a comment to that effect.
This kernel bug is the underlying cause of both the linked passt bug and
the linked podman bug. We attempted to fix this before with passt commit
d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag").
However while that commit masked the bug for some cases, it didn't really
address the problem.
Fixes: d3192f67c492 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag")
Link: https://github.com/containers/podman/issues/20170
Link: https://bugs.passt.top/show_bug.cgi?id=74
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
tcp_clamp_window() is _mostly_ about using TCP_WINDOW_CLAMP to control the
sock side advertised window, but it is also responsible for actually
updating the conn->wnd_from_tap value.
Rename to tcp_tap_window_update() to reflect that broader purpose, and pull
the logic that's not TCP_WINDOW_CLAMP related out to the front.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
logmsg() takes printf like arguments, but because it's not a built in, the
compiler won't generate warnings if the format string and parameters don't
match. Enable those by using the format attribute.
Strictly speaking this is a gcc extension, but I believe it is also
supported by some other common compilers. We already use some other
attributes in various places. For now, just use it and we can worry about
compilers that don't support it if it comes up.
This exposes some warnings from existing callers, both in gcc and in
clang-tidy:
- Some are straight out bugs, which we correct
- It's occasionally useful to invoke the logging functions with an empty
string, which gcc objects to, so disable that specific warning in the
Makefile
- Strictly speaking the C standard requires that the parameter for a %p
be a (void *), not some other pointer type. That's only likely to cause
problems in practice on weird architectures with different sized
representations for pointers to different types. Nonetheless add the
casts to make it happy.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Use of tcp_l2_mh has been removed in commit 38fbfdbcb95d,
but its declaration and initialization are always in the code.
Remove them as they are useless.
Fixes: 38fbfdbcb95d ("tcp: Get rid of iov with cached MSS, drop sendmmsg(), add deferred flush")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
For now, packets passed to the various *_tap_handler() functions always
come from the single "tap" interface. We want to allow the possibility to
broaden that in future. As preparation for that, have the code in tap.c
pass the pif id of the originating interface to each of those handler
functions.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
For certain socket types, we record in the epoll ref whether they're
sockets in the namespace, or on the host. We now have the notion of "pif"
to indicate what "place" a socket is associated with, so generalise the
simple one-bit 'ns' to a pif id.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
get_bound_ports_*() now only use their context and ns parameters to
determine which forwarding maps they're operating on. Each function needs
the map they're actually updating, as well as the map for the other
direction, to avoid creating forwarding loops. The UDP function also
requires the corresponding TCP map, to implement the behaviour where we
forward UDP ports of the same number as bound TCP ports for tools like
iperf3.
Passing those maps directly as parameters simplifies the code without
making the callers life harder, because those already know the relevant
maps. IMO, invoking these functions in terms of where they're looking for
updated forwarding also makes more logical sense than in terms of where
they're looking for bound ports. Given that new way of looking at the
functions, also rename them to port_fwd_scan_*().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently get_bound_ports() takes a parameter to determine if it scans for
UDP or TCP bound ports, but in fact there's almost nothing in common
between those two paths. The parameter appears primarily to have been
a convenience for when we needed to invoke this function via NS_CALL().
Now that we don't need that, split it into separate TCP and UDP versions.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we want to scan for bound ports in the namespace we use NS_CALL() to
run get_bound_ports() in the namespace. However, the only thing it
actually needed to be in the namespace for was to open the /proc/net file
it was scanning. Since we now always pre-open those, we no longer need
to switch to the namespace for the actual get_bound_ports() calls.
That in turn means that tcp_port_detect() doesn't need to run in the ns
either, and we can just replace it with inline calls to get_bound_ports().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The implementation of scanning /proc files to do automatic port forwarding
is a bit awkwardly split between procfs_scan_listen() in util.c,
get_bound_ports() and related functions in conf.c and the initial setup
code in conf().
Consolidate all of this into port_fwd.h, which already has some related
definitions, and a new port_fwd.c.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...so that we'll retry sending them, instead of more-or-less silently
dropping them. This happens quite frequently if our sending buffer on
the UNIX domain socket is heavily constrained (for instance, by the
208 KiB default memory limit).
It might be argued that dropping frames is part of the expected TCP
flow: we don't dequeue those from the socket anyway, so we'll
eventually retransmit them.
But we don't need the receiver to tell us (by the way of duplicate or
missing ACKs) that we couldn't send them: we already know as
sendmsg() reports that. This seems to considerably increase
throughput stability and throughput itself for TCP connections with
default wmem_max values.
Unfortunately, the 16 bits left as padding in the frame descriptors
we use internally aren't enough to uniquely identify for which
connection we should update sequence numbers: create a parallel
array of pointers to sequence numbers and L4 lengths, of
TCP_FRAMES_MEM size, and go through it after calling sendmsg().
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It looks like we need it as workaround for this situation, readily
reproducible at least with a 6.5 Linux kernel, with default rmem_max
and wmem_max values:
- an iperf3 client on the host sends about 160 KiB, typically
segmented into five frames by passt. We read this data using
MSG_PEEK
- the iperf3 server on the guest starts receiving
- meanwhile, the host kernel advertised a zero-sized window to the
sender, as expected
- eventually, the guest acknowledges all the data sent so far, and
we drop it from the buffer, courtesy of tcp_sock_consume(), using
recv() with MSG_TRUNC
- the client, however, doesn't get an updated window value, and
even keepalive packets are answered with zero-window segments,
until the connection is closed
It looks like dropping data from a socket using MSG_TRUNC doesn't
cause a recalculation of the window, which would be expected as a
result of any receiving operation that invalidates data on a buffer
(that is, not with MSG_PEEK).
Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
the previous value we clamped to, forces a recalculation of the
window which is advertised to the sender.
I couldn't quite confirm this issue by following all the possible
code paths in the kernel, yet. If confirmed, this should be fixed in
the kernel, but meanwhile this workaround looks robust to me (and it
will be needed for backward compatibility anyway).
Reported-by: Matej Hrica <mhrica@redhat.com>
Link: https://bugs.passt.top/show_bug.cgi?id=74
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
|
|
|
|
|
|
|
|
|
| |
Note that tcp_sock_consume() doesn't update ACK sequence counters
anymore.
Fixes: cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
|