passt, branch 2024_11_21.238c69f

tcp: Acknowledge keep-alive segments, ignore them for the rest

2024-11-21T05:52:36+00:00

RFC 9293, 3.8.4 says:

   Implementers MAY include "keep-alives" in their TCP implementations
   (MAY-5), although this practice is not universally accepted.  Some
   TCP implementations, however, have included a keep-alive mechanism.
   To confirm that an idle connection is still active, these
   implementations send a probe segment designed to elicit a response
   from the TCP peer.  Such a segment generally contains SEG.SEQ =
   SND.NXT-1 and may or may not contain one garbage octet of data.  If
   keep-alives are included, the application MUST be able to turn them
   on or off for each TCP connection (MUST-24), and they MUST default to
   off (MUST-25).

but currently, tcp_data_from_tap() is not aware of this and will
schedule a fast re-transmit on the second keep-alive (because it's
also a duplicate ACK), ignoring the fact that the sequence number was
rewinded to SND.NXT-1.

ACK these keep-alive segments, reset the activity timeout, and ignore
them for the rest.

At some point, we could think of implementing an approximation of
keep-alive segments on outbound sockets, for example by setting
TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single
keep-alive segment at approximately the same time, and never reset the
connection. That's beyond the scope of this fix, though.

Reported-by: Tim Besard 
Link: https://github.com/containers/podman/discussions/24572
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

tcp: Reset ACK_TO_TAP_DUE flag whenever an ACK isn't needed anymore

2024-11-21T05:51:25+00:00

We enter the timer handler with the ACK_TO_TAP_DUE flag, call
tcp_prepare_flags() with ACK_IF_NEEDED, and realise that we
acknowledged everything meanwhile, so we return early, but we also
need to reset that flag to avoid unnecessarily scheduling the timer
over and over again until more pending data appears.

I'm not sure if this fixes any real issue, but I've spotted this
in several logs reported by users, including one where we have some
unexpected bursts of high CPU load during TCP transfers at low rates,
from https://github.com/containers/podman/issues/23686.

Link: https://github.com/containers/podman/discussions/24572
Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

ndp: Don't send unsolicited RAs if NDP is disabled

2024-11-19T20:10:42+00:00

We recently added support for sending unsolicited NDP Router Advertisement
packets.  While we (correctly) disable this if the --no-ra option is given
we incorrectly still send them if --no-ndp is set.  Fix the oversight.

Fixes: 6e1e44293ef9 ("ndp: Send unsolicited Router Advertisements")
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

ndp: Don't send unsolicited router advertisement if we can't, yet

2024-11-19T20:10:14+00:00

ndp_timer() is called right away on the first epoll_wait() cycle,
when the communication channel to the guest isn't ready yet:

  1.0038: NDP: sending unsolicited RA, next in 264s
  1.0038: tap: failed to send 1 frames of 1

check that it's up before sending it. This effectively delays the
first gratuitous router advertisement, which is probably a good idea
given that we expect the guest to send a router solicitation right
away.

Fixes: 6e1e44293ef9 ("ndp: Send unsolicited Router Advertisements")
Signed-off-by: Stefano Brivio 
Reviewed-by: David Gibson

selinux: Use auth_read_passwd() interface for all our getpwnam() needs

2024-11-19T20:10:14+00:00

If passt or pasta are started as root, we need to read the passwd file
(be it /etc/passwd or whatever sssd provides) to find out UID and GID
of 'nobody' so that we can switch to it.

Instead of a bunch of allow rules for passwd_file_t and sssd macros,
use the more convenient auth_read_passwd() interface which should
cover our usage of getpwnam().

The existing rules weren't actually enough:

  # strace -e openat passt -f
  [...]
  Started as root, will change to nobody.
  openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 4
  openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 4
  openat(AT_FDCWD, "/lib64/libnss_sss.so.2", O_RDONLY|O_CLOEXEC) = 4
  openat(AT_FDCWD, "/var/lib/sss/mc/passwd", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
  openat(AT_FDCWD, "/var/lib/sss/mc/passwd", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
  openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 4

with corresponding SELinux warnings logged in audit.log.

Reported-by: Minxi Hou 
Analysed-by: Miloš Malik 
Signed-off-by: Stefano Brivio

ndp: Send unsolicited Router Advertisements

2024-11-14T18:00:40+00:00

Currently, our NDP implementation only sends Router Advertisements (RA)
when it receives a Router Solicitation (RS) from the guest.  However,
RFC 4861 requires that we periodically send unsolicited RAs.

Linux as a guest also requires this: it will send an RS when a link first
comes up, but the route it gets from this will have a finite lifetime (we
set this to 65535s, the maximum allowed, around 18 hours).  When that
expires the guest will not send a new RS, but instead expects the route to
have been renewed (if still valid) by an unsolicited RA.

Implement sending unsolicited RAs on a partially randomised timer, as
required by RFC 4861.  The RFC also specifies that solicited RAs should
also be delayed, or even omitted, if the next unsolicited RA is soon
enough.  For now we don't do that, always sending an immediate RA in
response to an RS.  We can get away with this because in our use cases
we expect to just have passt itself and the guest on the link, rather than
a large broadcast domain.

Link: https://github.com/kubevirt/kubevirt/issues/13191
Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

passt: Seed libc's pseudo random number generator

2024-11-14T18:00:38+00:00

We have an upcoming case where we need pseudo-random numbers to scatter
timings, but we don't need cryptographically strong random numbers.  libc's
built in random() is fine for this purpose, but we should seed it.  Extend
secret_init() - the only current user of random numbers - to do this as
well as generating the SipHash secret.  Using /dev/random for a PRNG seed
is probably overkill, but it's simple and we only do it once, so we might
as well.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

util: Add general low-level random bytes helper

2024-11-14T18:00:36+00:00

Currently secret_init() open codes getting good quality random bytes from
the OS, either via getrandom(2) or reading /dev/random.  We're going to
add at least one more place that needs random data in future, so make a
general helper for getting random bytes.  While we're there, fix a number
of minor bugs:
 - getrandom() can theoretically return a "short read", so handle that case
 - getrandom() as well as read can return a transient EINTR
 - We would attempt to read data from /dev/random if we failed to open it
   (open() returns -1), but not if we opened it as fd 0 (unlikely, but ok)
 - More specific error reporting

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

ndp: Make route lifetime a #define

2024-11-14T18:00:34+00:00

Currently we open-code the lifetime of the route we advertise via NDP to be
65535s (the maximum).  Change it to a #define.

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio

ndp: Use struct assignment in preference to memcpy() for IPv6 addresses

2024-11-14T18:00:31+00:00

There are a number of places we can simply assign IPv6 addresses about,
rather than the current mildly ugly memcpy().

Signed-off-by: David Gibson 
Signed-off-by: Stefano Brivio