passt, branch 2025_01_21.4f2c8e7

vhost_user: Drop packet with unsupported iovec array

2025-01-21T13:30:42+00:00

If the iovec array cannot be managed, drop it rather than
passing the second entry to tap_add_packet().

Signed-off-by: Laurent Vivier 
Signed-off-by: Stefano Brivio

tcp: Set PSH flag for last incoming packets in a batch

2025-01-21T13:28:44+00:00

So far we omitted setting PSH flags for inbound traffic altogether: as
we ignore the nature of the data we're sending, we can't conclude that
some data is more or less urgent. This works fine with Linux guests,
as the Linux kernel doesn't do much with it, on input: it will
generally deliver data to the application layer without delay.

However, with Windows, things change: if we don't set the PSH flag on
interactive inbound traffic, we can expect long delays before the data
is delivered to the application.

This is very visible with RDP, where packets we send on behalf of the
RDP client are delivered with delays exceeding one second:

  $ tshark -r rdp.pcap -td -Y 'frame.number in { 33170 .. 33173 }' --disable-protocol tls
  33170   0.030296 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285229 Win=387968 Len=0
  33171   0.985412 88.198.0.164 → 93.235.154.248 105 TCP 3389 → 49012 [PSH, ACK] Seq=285229 Ack=13820 Win=63198 Len=51
  33172   0.030373 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285280 Win=387968 Len=0
  33173   1.383776 88.198.0.164 → 93.235.154.248 424 TCP 3389 → 49012 [PSH, ACK] Seq=285280 Ack=13820 Win=63198 Len=370

in this example (packet capture taken by passt), frame #33172 is a
mouse event sent by the RDP client, and frame #33173 is the first
event (display reacting to click) sent back by the server. This
appears as a 1.4 s delay before we get frame #33173.

If we set PSH, instead:

  $ tshark -r rdp_psh.pcap -td -Y 'frame.number in { 314 .. 317 }' --disable-protocol tls
  314   0.002503 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7779 Ack=74047 Win=31872 Len=116
  315   0.000557 88.198.0.164 → 93.235.154.248 54 TCP 3389 → 51066 [ACK] Seq=79162 Ack=7895 Win=62872 Len=0
  316   0.012752 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7895 Ack=79162 Win=31872 Len=116
  317   0.011927 88.198.0.164 → 93.235.154.248 107 TCP 3389 → 51066 [PSH, ACK] Seq=79162 Ack=8011 Win=62756 Len=53

here, in frame #316, our mouse event is delivered without a delay and
receives a response in approximately 12 ms.

Set PSH on the last segment for any batch we dequeue from the socket,
that is, set it whenever we know that we might not be sending data to
the same port for a while.

Reported-by: NN708
Link: https://bugs.passt.top/show_bug.cgi?id=107
Signed-off-by: Stefano Brivio

tcp: Set ACK flag on all RST segments, even for client in SYN-SENT state

2025-01-21T13:28:44+00:00

Somewhat curiously, RFC 9293, section 3.10.7.3, states:

   If the state is SYN-SENT, then
   [...]

      Second, check the RST bit:
      -  If the RST bit is set,
      [...]

         o  If the ACK was acceptable, then signal to the user "error:
            connection reset", drop the segment, enter CLOSED state,
            delete TCB, and return.  Otherwise (no ACK), drop the
            segment and return.

which matches verbatim RFC 793, pages 66-67, and is implemented as-is
by tcp_rcv_synsent_state_process() in the Linux kernel, that is:

	/* No ACK in the segment */

	if (th->rst) {
		/* rfc793:
		 * "If the RST bit is set
		 *
		 *      Otherwise (no ACK) drop the segment and return."
		 */

		goto discard_and_undo;
	}

meaning that if a client is in SYN-SENT state, and we send a RST
segment once we realise that we can't establish the outbound
connection, the client will ignore our segment and will need to
pointlessly wait until the connection times out instead of aborting
it right away.

The ACK flag on a RST, in this case, doesn't really seem to have any
function, but we must set it nevertheless. The ACK sequence number is
already correct because we always set it before calling
tcp_prepare_flags(), whenever relevant.

This leaves us with no cases where we should *not* set the ACK flag
on non-SYN segments, so always set the ACK flag for RST segments.

Note that non-SYN, non-RST segments were already covered by commit
4988e2b40631 ("tcp: Unconditionally force ACK for all !SYN, !RST
packets").

Reported-by: Dirk Janssen 
Reported-by: Roeland van de Pol 
Reported-by: Robert Floor 
Signed-off-by: Stefano Brivio

tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets

2025-01-21T13:28:37+00:00

Following up on 725acd111ba3 ("tcp_splice: Set (again) TCP_NODELAY on
both sides"), David argues that, in general, we don't know what kind
of TCP traffic we're dealing with, on any side or path.

TCP segments might have been delivered to our socket with a PSH flag,
but we don't have a way to know about it.

Similarly, the guest might send us segments with PSH or URG set, but
we don't know if we should generally TCP_CORK sockets and uncork on
those flags, because that would assume they're running a Linux kernel
(and a particular version of it) matching the kernel that delivers
outbound packets for us.

Given that we can't make any assumption and everything might very well
be interactive traffic, disable Nagle's algorithm on all non-spliced
sockets as well.

After all, John Nagle himself is nowadays recommending that delayed
ACKs should never be enabled together with his algorithm, but we
don't have a practical way to ensure that our environment is free from
delayed ACKs (TCP_QUICKACK is not really usable for this purpose):

  https://news.ycombinator.com/item?id=34180239

Suggested-by: David Gibson 
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

tcp: Buffer sizes are not inherited on accept()/accept4()

2025-01-21T13:28:14+00:00

...so it's pointless to set SO_RCVBUF and SO_SNDBUF on listening
sockets.

Call tcp_sock_set_bufsize() after accept4(), for inbound sockets.

As we didn't have large buffer sizes set for inbound sockets for
a long time (they are set explicitly only if the maximum size is
big enough, more than than the ~200 KiB default), I ran some more
throughput tests for this one, and I see slightly better numbers
(say, 17 gbps instead of 15 gbps guest to host without vhost-user).

Fixes: 904b86ade7db ("tcp: Rework window handling, timers, add SO_RCVLOWAT and pools for sockets/pipes")
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

vhost_user: remove ASSERT() on iovec number

2025-01-20T18:51:24+00:00

Replace ASSERT() on the number of iovec in the element and on
the first entry length by a debug() message.

Signed-off-by: Laurent Vivier 
[sbrivio: Fix typo in failure message]
Signed-off-by: Stefano Brivio

vhost-user: Report to front-end we support VHOST_USER_PROTOCOL_F_DEVICE_STATE

2025-01-20T18:51:24+00:00

Report to front-end that we support device state commands:
VHOST_USER_CHECK_DEVICE_STATE
VHOST_USER_SET_LOG_BASE

These feature is needed to transfer backend state using frontend
channel.

Signed-off-by: Laurent Vivier 
Signed-off-by: Stefano Brivio

vhost-user: add VHOST_USER_SET_DEVICE_STATE_FD command

2025-01-20T18:51:24+00:00

Set the file descriptor to use to transfer the
backend device state during migration.

Signed-off-by: Laurent Vivier 
[sbrivio: Fixed nits and coding style here and there]
Signed-off-by: Stefano Brivio

vhost-user: add VHOST_USER_CHECK_DEVICE_STATE command

2025-01-20T18:51:24+00:00

After transferring the back-end’s internal state during migration,
check whether the back-end was able to successfully fully process
the state.

The value returned indicates success or error;
0 is success, any non-zero value is an error.

Signed-off-by: Laurent Vivier 
Signed-off-by: Stefano Brivio

vhost-user: Report to front-end we support VHOST_USER_PROTOCOL_F_LOG_SHMFD

2025-01-20T18:51:24+00:00

This features allows QEMU to be migrated. We need also to report
VHOST_F_LOG_ALL.

This protocol feature reports we can log the page update and
implement VHOST_USER_SET_LOG_BASE and VHOST_USER_SET_LOG_FD.

Signed-off-by: Laurent Vivier 
Signed-off-by: Stefano Brivio

passt, branch 2025_01_21.4f2c8e7

vhost_user: Drop packet with unsupported iovec array

tcp: Set PSH flag for last incoming packets in a batch

tcp: Set ACK flag on *all* RST segments, even for client in SYN-SENT state

tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets

tcp: Buffer sizes are *not* inherited on accept()/accept4()

vhost_user: remove ASSERT() on iovec number

vhost-user: Report to front-end we support VHOST_USER_PROTOCOL_F_DEVICE_STATE

vhost-user: add VHOST_USER_SET_DEVICE_STATE_FD command

vhost-user: add VHOST_USER_CHECK_DEVICE_STATE command

vhost-user: Report to front-end we support VHOST_USER_PROTOCOL_F_LOG_SHMFD

tcp: Set ACK flag on all RST segments, even for client in SYN-SENT state

tcp: Buffer sizes are not inherited on accept()/accept4()