passt, branch 2024_07_26.57a21d2

tap: Improve handling of partially received frames on qemu socket

2024-07-26T12:07:42+00:00

Because the Unix socket to qemu is a stream socket, we have no guarantee
of where the boundaries between recv() calls will lie.  Typically they
will lie on frame boundaries, because that's how qemu will send then, but
we can't rely on it.

Currently we handle this case by detecting when we have received a partial
frame and performing a blocking recv() to get the remainder, and only then
processing the frames. Change it so instead we save the partial frame
persistently and include it as the first thing processed next time we
receive data from the socket.  This handles a number of (unlikely) cases
which previously would not be dealt with correctly:

* If qemu sent a partial frame then waited some time before sending the
  remainder, previously we could block here for an unacceptably long time
* If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
  doing the partial frame handling, which would put us out of sync with
  the stream from qemu
* If a the blocking recv() only received some of the remainder of the
  frame, not all of it, we'd return leaving us out of sync with the
  stream again

Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU).  This
is probably acceptable because it's an unlikely case in practice.  If
necessary we could mitigate this by using a true ring buffer.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tap: Correctly handle frames of odd length

2024-07-26T12:07:38+00:00

The Qemu socket protocol consists of a 32-bit frame length in network (BE)
order, followed by the Ethernet frame itself.  As far as I can tell,
frames can be any length, with no particular alignment requirement.  This
means that although pkt_buf itself is aligned, if we have a frame of odd
length, frames after it will have their frame length at an unaligned
address.

Currently we load the frame length by just casting a char pointer to
(uint32_t *) and loading.  Some platforms will generate a fatal trap on
such an unaligned load.  Even if they don't casting an incorrectly aligned
pointer to (uint32_t *) is undefined behaviour, strictly speaking.

Introduce a new helper to safely load a possibly unaligned value here.  We
assume that the compiler is smart enough to optimize this into nothing on
platforms that provide performant unaligned loads.  If that turns out not
to be the case, we can look at improvements then.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tap: Don't use EPOLLET on Qemu sockets

2024-07-26T12:07:20+00:00

Currently we set EPOLLET (edge trigger) on the epoll flags for the
connected Qemu Unix socket.  It's not clear that there's a reason for
doing this: for TCP sockets we need to use EPOLLET, because we leave data
in the socket buffers for our flow control handling.  That consideration
doesn't apply to the way we handle the qemu socket however.

Furthermore, using EPOLLET causes additional complications:

1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however
   we *do* set it when using pasta mode with --fd.  This inconsistency
   doesn't seem to have broken anything, but it's odd.

2) EPOLLET requires that tap_handler_passt() loop until all data available
   is read (otherwise we may have data in the buffer but never get an event
   causing us to read it).  We do that with a rather ugly goto.

   Worse, our condition for that goto appears to be incorrect.  We'll only
   loop if rem is non-zero, which will only happen if we perform a blocking
   recv() for a partially received frame.  We'll only perform that second
   recv() if the original recv() resulted in a partially read frame.  As
   far as I can tell the original recv() could end on a frame boundary
   (never triggering the second recv()) even if there is additional data in
   the socket buffer.  In that circumstance we wouldn't goto redo and could
   leave unprocessed frames in the qemu socket buffer indefinitely.

   This doesn't seem to have caused any problems in practice, but since
   there's no obvious reason to use EPOLLET here anyway, we might as well
   get rid of it.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tap: Don't attempt to carry on if we get a bad frame length from qemu

2024-07-26T12:05:44+00:00

If we receive a too-short or too-long frame from the QEMU socket, currently
we try to skip it and carry on.  That sounds sensible on first blush, but
probably isn't wise in practice.  If this happens, either (a) qemu has done
something seriously unexpected, or (b) we've received corrupt data over a
Unix socket.  Or more likely (c), we have a bug elswhere which has put us
out of sync with the stream, so we're trying to read something that's not a
frame length as a frame length.

Neither (b) nor (c) is really salvageable with the same stream.  Case (a)
might be ok, but we can no longer be confident qemu won't do something else
we can't cope with.

So, instead of just skipping the frame and trying to carry on, log an error
and close the socket.  As a bonus, establishing firm bounds on l2len early
will allow simplifications to how we deal with the case where a partial
frame is recv()ed.

Signed-off-by: David Gibson 
[sbrivio: Change error message: it's not necessarily QEMU, and mention
 that we are resetting the connection]
Signed-off-by: Stefano Brivio

tap: Better report errors receiving from QEMU socket

2024-07-26T11:51:23+00:00

If we get an error on recv() from the QEMU socket, we currently don't
print any kind of error.  Although this can happen in a non-fatal situation
such as a guest restarting, it's unusual enough that we realy should report
something for debugability.

Add an error message in this case.  Also always report when the qemu
connection closes for any reason, not just when it will cause us to exit
(--one-off).

Signed-off-by: David Gibson 
[sbrivio: Change error message: it's not necessarily QEMU, and mention
 that we are resetting the connection]
Signed-off-by: Stefano Brivio

log: Fetch log times with CLOCK_MONOTONIC, not CLOCK_REALTIME

2024-07-26T11:47:36+00:00

We report relative timestamps in logs, so we want to avoid jumps in
the system time.

Suggested-by: David Gibson 
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

log: Initialise timestamp for relative log time also if we use a log file

2024-07-26T11:47:34+00:00

...not just for debug messages. Otherwise, timestamps in the log file
are consistent but the starting point is not zero.

Do this right away as we enter main(), so that the resulting
timestamps are as closely as possible relative to when we start.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

log, util: Fix sub-second part in relative log time calculation

2024-07-26T11:43:19+00:00

For some reason, in commit 01efc71ddd25 ("log, conf: Add support for
logging to file"), I added calculations for relative logging
timestamps using the difference for the seconds part only, not for
accounting for the fractional part.

Fix that by storing the initial timestamp, log_start, as a timespec
struct, and by calculating the difference from the starting time. Do
this in a macro as we need the same format in a few places.

To calculate the difference, turn the existing timespec_diff_ms() to
microseconds, timespec_diff_us(), and rewrite timespec_diff_ms() to
use that.

Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

test/lib/perf_report: Fix highlight

2024-07-25T15:56:47+00:00

Signed-off-by: Stefano Brivio

test: Fix spurious test failure with systemd-resolved

2024-07-25T10:39:00+00:00

systemd-resolved has the rather strange behaviour of listening on the
non-standard loopback address 127.0.0.53.  Various changes we've made in
passt mean that we now usually work fine on a host using systemd-resolved.
However our tests still fail in this case.  We have a special case for when
the guest's resolv.conf needs to differ from the host's because the
resolver is on a host loopback address.  However, we only consider the case
where the host resolver is on 127.0.0.1, not other loopback addresses.

Correct this with a different test condition.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio