passt/tap.h, branch 2025_06_11.0293c6f

tap: Make size of pool_tap[46] purely a tuning parameter

2025-03-20T19:33:09+00:00

Currently we attempt to size pool_tap[46] so they have room for the maximum
possible number of packets that could fit in pkt_buf (TAP_MSGS).  However,
the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
the minimum possible L2 frame size.  But ETH_ZLEN is based on physical
constraints of Ethernet, which don't apply to our virtual devices.  It is
possible to generate a legitimate frame smaller than this, for example an
empty payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long.

Further more, the same limit applies for vhost-user, which is not limited
by the size of pkt_buf like the other backends.  In that case we don't even
have full control of the maximum buffer size, so we can't really calculate
how many packets could fit in there.

If we exceed do TAP_MSGS we'll drop packets, not just use more batches,
which is moderately bad.  The fact that this needs to be sized just so for
correctness not merely for tuning is a fairly non-obvious coupling between
different parts of the code.

To make this more robust, alter the tap code so it doesn't rely on
everything fitting in a single batch of TAP_MSGS packets, instead breaking
into multiple batches as necessary.  This leaves TAP_MSGS as purely a
tuning parameter, which we can freely adjust based on performance measures.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

pcap: Correctly set snaplen based on tap backend type

2025-03-12T22:08:33+00:00

The pcap header includes a value indicating how much of each frame is
captured.  We always capture the entire frame, so we want to set this to
the maximum possible frame size.  Currently we do that by setting it to
ETH_MAX_MTU, but that's a confusingly named constant which might not always
be correct depending on the details of our tap backend.

Instead add a tap_l2_max_len() function that explicitly returns the maximum
frame size for the current mode and use that to set snaplen.  While we're
there, there's no particular need for the pcap header to be defined in a
global; make it local to pcap_init() instead.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tap: Use explicit defines for maximum length of L2 frame

2025-03-12T22:08:33+00:00

Currently in tap.c we (mostly) use ETH_MAX_MTU as the maximum length of
an L2 frame.  This define comes from the kernel, but it's badly named and
used confusingly.

First, it doesn't really have anything to do with Ethernet, which has no
structural limit on frame lengths.  It comes more from either a) IP which
imposes a 64k datagram limit or b) from internal buffers used in various
places in the kernel (and in passt).

Worse, MTU generally means the maximum size of the IP (L3) datagram which
may be transferred, _not_ counting the L2 headers.  In the kernel
ETH_MAX_MTU is sometimes used that way, but sometimes seems to be used as
a maximum frame length, _including_ L2 headers.  In tap.c we're mostly
using it in the second way.

Finally, each of our tap backends could have different limits on the frame
size imposed by the mechanisms they're using.

Start clearing up this confusion by replacing it in tap.c with new
L2_MAX_LEN_* defines which specifically refer to the maximum L2 frame
length for each backend.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

udp: create and send ICMPv6 to local peer when applicable

2025-03-07T01:21:24+00:00

When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMPv6 message containing
the error code ICMP6_DST_UNREACH_NOPORT, plus the IPv6 header, UDP header
and the first 1232 bytes of the original message, if any. If the sender
socket has been connected, it uses this message to issue a
"Connection Refused" event to the user.

Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.

We now fix this for IPv6 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.

Note that for the sake of completeness, we even produce ICMP messages
for other error types and codes. We have noticed that at least
ICMP_PROT_UNREACH is propagated as an error event back to the user.

Reviewed-by: David Gibson 
Signed-off-by: Jon Maloy 
[sbrivio: fix cppcheck warning, udp_send_conn_fail_icmp6() doesn't
 modify saddr which can be declared as const]
Signed-off-by: Stefano Brivio

tap: break out building of udp header from tap_udp6_send function

2025-03-07T01:21:24+00:00

We will need to build the UDP header at other locations than in function
tap_udp6_send(), so we break that part out to a separate function.

Reviewed-by: David Gibson 
Signed-off-by: Jon Maloy 
Signed-off-by: Stefano Brivio

udp: create and send ICMPv4 to local peer when applicable

2025-03-07T01:21:19+00:00

When a local peer sends a UDP message to a non-existing port on an
existing remote host, that host will return an ICMP message containing
the error code ICMP_PORT_UNREACH, plus the header and the first eight
bytes of the original message. If the sender socket has been connected,
it uses this message to issue a "Connection Refused" event to the user.

Until now, we have only read such events from the externally facing
socket, but we don't forward them back to the local sender because
we cannot read the ICMP message directly to user space. Because of
this, the local peer will hang and wait for a response that never
arrives.

We now fix this for IPv4 by recreating and forwarding a correct ICMP
message back to the internal sender. We synthesize the message based
on the information in the extended error structure, plus the returned
part of the original message body.

Note that for the sake of completeness, we even produce ICMP messages
for other error codes. We have noticed that at least ICMP_PROT_UNREACH
is propagated as an error event back to the user.

Reviewed-by: David Gibson 
Signed-off-by: Jon Maloy 
[sbrivio: fix cppcheck warning: udp_send_conn_fail_icmp4() doesn't
 modify 'in', it can be declared as const]
Signed-off-by: Stefano Brivio

tap: break out building of udp header from tap_udp4_send function

2025-03-06T19:17:36+00:00

We will need to build the UDP header at other locations than in function
tap_udp4_send(), so we break that part out to a separate function.

Reviewed-by: David Gibson 
Signed-off-by: Jon Maloy 
Signed-off-by: Stefano Brivio

tcp: Send RST in response to guest packets that match no connection

2025-03-05T20:46:32+00:00

Currently, if a non-SYN TCP packet arrives which doesn't match any existing
connection, we simply ignore it.  However RFC 9293, section 3.10.7.1 says
we should respond with an RST to a non-SYN, non-RST packet that's for a
CLOSED (i.e. non-existent) connection.

This can arise in practice with migration, in cases where some error means
we have to discard a connection.  We destroy the connection with tcp_rst()
in that case, but because the guest is stopped, we may not be able to
deliver the RST packet on the tap interface immediately.  This change
ensures an RST will be sent if the guest tries to use the connection again.

A similar situation can arise if a passt/pasta instance is killed or
crashes, but is then replaced with another attached to the same guest.
This can leave the guest with stale connections that the new passt instance
isn't aware of.  It's better to send an RST so the guest knows quickly
these are broken, rather than letting them linger until they time out.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

tap: Remove unused ETH_HDR_INIT() macro

2025-02-18T07:43:18+00:00

The uses of this macro were removed in d4598e1d18ac ("udp: Use the same
buffer for the L2 header for all frames").

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

vhost-user: add vhost-user

2024-11-27T15:47:32+00:00

add virtio and vhost-user functions to connect with QEMU.

  $ ./passt --vhost-user

and

  # qemu-system-x86_64 ... -m 4G \
        -object memory-backend-memfd,id=memfd0,share=on,size=4G \
        -numa node,memdev=memfd0 \
        -chardev socket,id=chr0,path=/tmp/passt_1.socket \
        -netdev vhost-user,id=netdev0,chardev=chr0 \
        -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
        ...

Signed-off-by: Laurent Vivier 
Reviewed-by: David Gibson 
[sbrivio: as suggested by lvivier, include 
 before including  as C libraries such as musl
 __UAPI_DEF_ETHHDR in  if they already have
 a definition of struct ethhdr]
Signed-off-by: Stefano Brivio